Hi Jxiong, there are a large number of distinct items that are proposed here (MOI, SLC, changelog improvements, coordinator improvements, a new rules engine, internal (kernel-side) data copy, master/slave job distribution. It seems to me that these may be implemented separately / incrementally, deriving lots of value without so much implementation effort up front. In particular, we at Cray have been thinking about a Lustre Fast Find which is very similar to your MOI, but generalized. This is very useful without any of the other HSM improvements.

Summary

Improve Lustre's 'lfs find' capability to run server-side, and include new indices, for fast MD-based searches (as an alternative to tree walking).

Background

One of the major reasons for implementing external MD databases (e.g. Robinhood) or inode scanning tools (e.g. LiPE, ne2scan) is to quickly find a set of files matching particular search criteria, without having to walk the entire FS namespace and stat'ing every file. Searches must be efficient at filtering down the entire set of files into a relatively small list that is actionable, i.e. that will be used to drive a policy (archive these files). If Lustre had an efficient 'lfs find' capability, this could generally take the place of other scanning/walking/DB-generating utilities, allowing much broader usage.

Description

Modify the lfs find command to return a filtered bulk list from the MDS, rather than a userspace client-side tree walk. A new find RPC is issued to the MDS, the MDS executes the search through the file metadata to generate the filtered list, and the results are returned to the client.

Search criteria

Search criteria should include any metadata stored on the MDS (e.g. user, xattr, OST number). For MD not stored on the MDS, approximations are required: lazy SoM (LUS-1772), lazy atime, etc. Multiple filters can be used in combination (e.g. larger than 1M and mtime > 1 day). For efficient filtering, the ordering of criteria is important. Common search criteria are likely to be modification time, access time, and rough size. Potentially, users might want the ability to label files are search by label.

Index creation

The MDS maintains key-value indices for each file (e.g. OI index maps FID to inode) via the dt_index_operations functionality of the OSD API. All backends support this. For Fast Find, we can create extra indices to sort files based on search criteria of interest. Every modifying RPC would include updates to each index. Indices would then allow efficient find-list generation: search for files with mtime < 1 day could quickly scan through the sorted mtime index to find a range of matching files.

List return format

The resulting file list should possibly include all available MD for each file in addition to the file names, to avoid having to issue subsequent stat's. Presuming a large number of items in the return list, the transfer of the list might be tricky. One possibility may be for the MDS to store the list persistently as a regular Lustre DoM file, attached to the namespace under /.lustre/. The userspace caller would receive a reference to the file, which it could then parse at will.

Resources

Every index update will require additional transaction blocks in the tx, and a write operation. These should be quick for a flash MDS, but updating many indices will be expensive. It will be best to update only those indices of importance for the user; i.e. they should be user selectable. More importantly, every find operation will require a potentially disk-heavy search on the MDS, requiring CPU and IOPS. This of course will be less total load than a namespace walk, but will be executed on a critical FS component.

DNE

Each DNE server would run the same search in parallel on its files. This gives us horizontally scalable searches (which RBH cannot achieve).

Services based on find

Services such as a purger, archiver, or tier manager can be based on find using extremely simple pipeline scripts, triggered by cron or a job scheduler. For example, a daily cron that runs lfs find /lustre -size +20M -mtime -1 | lfs hsm archive would archive all new files greater than 20MB. Slightly more complex services could take into account external information: Lustre space, archive space, system load, etc.

Advantages

Very simple policy implementations
General-purpose functionality for uses beyond policy engines
Save hardware and software related to tracking new files (changelog ingest) and maintaining external DBs (Robinhood)

Drawbacks

slower iops on mds with transactional index updates
'find' load on mds impacts other jobs
name-based searches (e.g. find *.jpg) can't be indexed, will be slow
each index as large as oi_index = 10GB?

Talk:Policy Engine Proposal

Contents