Subsystem Map

From Lustre Wiki
Jump to: navigation, search
Note: This page originated on the old Lustre wiki. It was identified as likely having value and was migrated to the new wiki. It is in the process of being reviewed/updated and may currently have content that is out of date.

(Updated: Nov 2008)

NOTE: An updated subsystem map, with links to Doxygen-generated API documentation and other documentation, is available in the Lustre Internals Documentation.


The Lustre™ subsystems are listed below. For each subsystem, a summary description and code is provided.

libcfs

Summary

Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.

Code

lustre/lnet/libcfs/**/*.[ch]

lnet

Summary

See the Lustre Networking white paper for details.

Code

lustre/lnet/**/*.[ch]

ptlrpc

Summary

Ptlrpc implements Lustre communications over LNET.

All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.

The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.

Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

Code

lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

llog

Summary

Overview

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.

LLog is used is various cases. The main LLog use cases are the following:

  • mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;
  • MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;
  • Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);
  • LOVEA llog - joins the file LOV EA merge log.

General design

Each llog type has two main parts:

  • ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;
  • REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

Code

obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c


For more information, see Logging API.

obdclass

Summary

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.

Several examples of obd devices include:

  • OSC - object storage client (connects over network to OST)
  • OST - object storage target
  • LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
  • MDC - meta data client (connects over network to MDT)
  • MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

Code

lustre/obdclass/class_hash.c        - scalable hash code for imports
lustre/obdclass/class_obd.c         - base device handling code
lustre/obdclass/debug.c             - helper routines for dumping data structs
lustre/obdclass/genops.c            - device allocation/configuration/connection
lustre/obdclass/linux-module.       - linux kernel module handling
lustre/obdclass/linux-obdo.c        - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c      - /proc/sys configuration parameters 
lustre/obdclass/lprocfs_status.c    - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c    - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c       - peer target identification by UUID
lustre/obdclass/obd_config.c        - configuration file parsing
lustre/obdclass/obd_mount.c         - server filesystem mounting
lustre/obdclass/obdo.c              - more obdo handling helpers
lustre/obdclass/statfs_pack.c       - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c              - UUID pack/unpack
lustre/lvfs/lvfs_common.c           - kernel interface helpers
lustre/lvfs/lvfs_darwin.c           - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h         - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c              - statistics
lustre/lvfs/lvfs_linux.c            - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c           - userspace helper routines
lustre/lvfs/prng.c                  - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c          - supplementary group upcall for MDS

luclass

Summary

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.

Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

  • Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;
  • Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;
  • Objects are reference counted, and cached;
  • Every object has a list of layers (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.

luclass design is specified in the MD API DLD.

Code

include/lu_object.h
obdclass/lu_object.c

ldlm

Summary

The Lustre Distributed Lock Manager (LDLM) is the Lustre locking infrastructure; it handles locks between clients and servers and locks local to a node. Different kinds of locks are available with different properties. Also as a historic heritage, ldlm happens to have some of the generic connection service code (both server and client).

Code

interval_tree.c           - This is used by extent locks to maintain interval trees (bug 11300)
l_lock.c                  - Resourse locking primitives. 
ldlm_extent.c             - Extents locking code used for locking regions inside objects
ldlm_flock.c              - Bsd and posix locking lock types
ldlm_inodebits.c          - Inodebis locks used for metadata locking
ldlm_lib.c                - Target and client connecting/reconnecting/recovery code.
                            Does not really belong to ldlm, but is historically placed 
                            there. Should be in ptlrpc instead.
ldlm_lock.c               - This source file mostly has functions dealing with struct
ldlm_lock ldlm_lockd.c    - Functions that imply replying to incoming lock-related rpcs 
                            (that could be both on server (lock enq/cancel/...) and client 
                            (ast handling)).
ldlm_plain.c              - Plain locks, predecessor to inodebits locks; not widely used now.
ldlm_pool.c               - Pools of locks, related to dynamic lrus and freeing locks on demand.
ldlm_request.c            - Collection of functions to work with locks based handles as opposed 
                            to lock structures themselves.
ldlm_resource.c           - Functions operating on namespaces and lock resources.
include/lustre_dlm.h      - Important defines and declarations for ldlm.

fids

Summary

FID is unique object identifier in cluster since 1.7. It has few properties, main of them are the following:

  • FID is unique and not reused object identifier;
  • FID is allocated by client inside of the sequence granted by server;
  • FID is base for ldlm resource used for issuing ldlm locks. This is because FID is unique and as such good for this using;
  • FID is base for building client side inode numbers as we can't use server inode+generation anymore, in CMD this is not unique combination;
  • FID does not contain store information like inode number or generation and as such easy to migrate;

FID consists of 3 fields:

  • f_seq - sequence number
  • f_oid - object identifier inside sequence
  • f_ver - object version

Code

fid/fid_request.c
fid/fid_lib.c
fld/*.[ch]

seq

Summary

Overview

Sequence management is a basic mechanism in new MDS server which is related to managing FIDs.

FID is an unique object identifier in Lustre starting from version 1.7. All FIDs are organized into sequences. One sequence is number of FIDs. Sequences are granted/allocated to clients by servers. FIDs are allocated by clients inside granted sequence. All FIDs inside one sequence live on same MDS server and as such are one "migration unit" and one "indexing unit", meaning that FLD (FIDs Location Database) indexes them all using one sequence and thus has only one mapping entry for all FIDs in sequence. Please read section devoted to FIDs bellow in the root table to find more info on FLD service and FIDs.

A sequence has the limit of FIDs to be allocated in it. When this limit is reached, new sequence is allocated. Upon disconnect, server allocates new sequence to the client when it comes back. Previously used sequence is abandoned even if it was not exhausted. Sequences are valuable resource but in the case of recovery, using new sequence makes things easier and also allows to group FIDs and objects by working sessions, new connection - new sequence.

Code description

Server side code is divided into two parts:

  • Sequence controller - allocates super-sequences, that is, sequences of sequences to all servers in cluster (currently only to MDSes as only they are new FIDs aware). Usually first MDS in cluster is sequence controller
  • Sequence manager - allocates meta-sequences (smaller range of sequences inside a super-sequence) to all clients, using granted super-sequence from the sequence controller. All MDSs in the cluster (all servers in the future) are sequence managers. The first MDS is, simultaneously, a sequence controller and a sequence manager.

Client side code allocates new sequences from granted meta-sequence. When meta-sequence is exhausted, new one is allocated on server and sent to the client.

Client code consists of API for working with both server side parts, not only with sequence manager as all servers need to talk to sequence controller, they also use client API for this.

One important part of client API is FIDs allocation. New FID is allocated in currently granted sequence until sequence is exhausted.

Code

fid/fid_handler.c    - server side sequence management code;
fid/fid_request.c    - client side sequence management code;
fid/fid_lib.c        - fids related miscellaneous stuff.

mountconf

Summary

MountConf is how servers and clients are set up, started, and configured. A MountConf usage document is here.

The major subsystems are the MGS, MGC, and the userspace tools mount.lustre and mkfs.lustre.

The basic idea is:

  1. Whenever any Lustre component is mount(2)ed, we start a MGC.
  2. This establishes a connection to the MGS and downloads a configuration llog.
  3. The MGC passes the configuration log through the parser to set up the other OBDs.
  4. The MGC holds a CR configuration lock, which the MGS recalls whenever a live configuration change is made.

Code

MountConf file areas:

lustre/mgs/*
lustre/mgc/*
lustre/obdclass/obd_mount.c
lustre/utils/mount_lustre.c
lustre/utils/mkfs_lustre.c

liblustre

Summary

Liblustre is a userspace library, used along with libsysio (developed by Sandia), that allows Lustre usage just by linking (or ld_preload'ing) applications with it. Liblustre does not require any kernel support. It is also used on old Cray XT3 machines (and not so old, in the case of Sandia), where all applications are just linked with the library and loaded into memory as the only code to run. Liblustre does not support async operations of any kind due to a lack of interrupts and other notifiers from lower levels to Lustre. Liblustre includes another set of LNDs that are able to work from userspace.

Code

dir.c          - Directory operations
file.c         - File handling operations (like open)
llite_lib.c    - General support (init/cleanp/parse options)
lutil.c        - Supplementary code to get IP addresses and init various structures 
                 needed to emulate the normal Linux process from other layers' perspective.
namei.c        - Metadata operations code.
rw.c           - I/O code, including read/write
super.c        - "Superblock" operation - mounting/umounting, inode operations.
tests          - directory with liblustre-specific tests.

echo client/server

Summary

The echo_client and obdecho are OBD devices which help testing and performance measurement.

They were implemented originally for network testing - obdecho can replace obdfilter and echo_client can exercise any downstream configurations. They are normally used in the following configurations:

  • echo_client -> obdfilter. This is used to measure raw backend performance without any network I/O.
  • echo_client -> OSC -> <network> -> OST -> obdecho. This is used to measure network and ptlrpc performance.
  • echo_client -> OSC -> <network> -> OST -> obdfilter. This is used to measure performance available to the Lustre client.

Code

lustre/obdecho/

client vfs

Summary

The client VFS interface, also called llite, is the bridge between the Linux kernel and the underlying Lustre infrastructure represented by the LOV, MDC, and LDLM subsystems. This includes mounting the client filesystem, handling name lookups, starting file I/O, and handling file permissions.

The Linux VFS interface shares a lot in common with the liblustre interface, which is used in the Catamount environment; as of yet, the code for these two subsystems is not common and contains a lot of duplication.

Code

lustre/llite/dcache.c            - Interface with Linux dentry cache/intents
lustre/llite/dir.c               - readdir handling, filetype in dir, dir ioctl
lustre/llite/file.c              - File handles, file ioctl, DLM extent locks
lustre/llite/llite_close.c       - File close for opencache
lustre/llite/llite_internal.h    - Llite internal function prototypes, structures
lustre/llite/llite_lib.c         - Majority of request handling, client mount
lustre/llite/llite_mmap.c        - Memory-mapped I/O
lustre/llite/llite_nfs.c         - NFS export from clients
lustre/llite/lloop.c             - Loop-like block device export from object
lustre/llite/lproc_llite.c       - /proc interface for tunables, statistics
lustre/llite/namei.c             - Filename lookup, intent handling
lustre/llite/rw24.c              - Linux 2.4 IO handling routines
lustre/llite/rw26.c              - Linux 2.6 IO handling routines
lustre/llite/rw.c                - Linux generic IO handling routines
lustre/llite/statahead.c         - Directory statahead for "ls -l" and "rm -r"
lustre/llite/super25.c           - Linux 2.6 VFS file method registration
lustre/llite/super.c             - Linux 2.4 VFS file method registration
lustre/llite/symlink.c           - Symbolic links
lustre/llite/xattr.c             - User-extended attributes

client vm

Summary

Client code interacts with VM/MM subsystems of the host OS kernel to cache data (in the form of pages), and to react to various memory-related events, like memory pressure.

Two key components of this interaction are:

  • cfs_page_t data-type representing MM page. It comes together with the interface to map/unmap page to/from kernel virtual address space, access various per-page bits, like 'dirty', 'uptodate', etc., lock/unlock page. Currently, this data-type closely matches the Linux kernel page. It has to be straightened out, formalized, and expanded to include functionality like querying about total number of pages on a node, etc.
  • MM page operations in cl_page (part of new client I/O interface).

Code

This describes the next generation Lustre client I/O code, which is expected to appear in Lustre 2.0. Code location is not finalized.

cfs_page_t interface is defined and implemented in:

lnet/include/libcfs/ARCH/ARCH-mem.h
lnet/libcfs/ARCH/ARCH-mem.c 

Generic part of cl-page will be located in:

include/cl_object.h
obdclass/cl_page.c
obdclass/cl_object.c 

Linux kernel implementation is currently in:

llite/llite_cl.c

client I/O

Summary

Client I/O is a group of interfaces used by various layers of a Lustre client to manage file data (as opposed to metadata). Main functions of these interfaces are:

  • Cache data, respecting limitations imposed both by hosting MM/VM, and by cluster-wide caching policies, and
  • Form a stream of efficient I/O RPCs, respecting both ordering/timing constraints imposed by the hosting VFS (e.g., POSIX guarantees, O_SYNC, etc.), and cluster-wide IO scheduling policies.

Client I/O subsystem interacts with VFS, VM/MM, DLM, and PTLRPC.

Client I/O interfaces are based on the following data-types:

  • cl_object: represents a file system object, both a file, and a stripe;
  • cl_page: represents a cached data page;
  • cl_lock: represents an extent DLM lock;
  • cl_io: represents an ongoing high-level IO activity, like read(2)/write(2) system call, or sub-io of another IO;
  • cl_req: represents a network RPC.

Code

This describes the next generation Lustre client I/O code. The code location is not finalized. The generic part is at:

include/cl_object.h
obdclass/cl_object.c
obdclass/cl_page.c
obdclass/cl_lock.c
obdclass/cl_io.c 

Layer-specific methods are currently at:

lustre/LAYER/LAYER_cl.c 

where LAYER is one of llite, lov, osc.

client metadata

Summary

The Metadata Client (MDC) is the client-side interface for all operations related to the Meta Data Server MDS. In current configurations there is a single MDC on the client for each filesystem mounted on the client. The MDC is responsible for enqueueing metadata locks (via LDLM), and packing and unpacking messages on the wire.

In order to ensure a recoverable system, the MDC is limited at the client to only a single filesystem-modifying operation in flight at one time. This includes operations like create, rename, link, unlink, setattr.

For non-modifying operations like getattr and statfs the client can multiple RPC requests in flight at one time, limited by a tunable on the client, to avoid overwhelming the MDS.

Code

lustre/mdc/lproc_mdc.c       - /proc interface for stats/tuning
lustre/mdc/mdc_internal.h    - Internal header for prototypes/structs
lustre/mdc/mdc_lib.c         - Packing of requests to MDS
lustre/mdc/mdc_locks.c       - Interface to LDLM and client VFS intents
lustre/mdc/mdc_reint.c       - Modifying requests to MDS
lustre/mdc/mdc_request.c     - Non-modifying requests to MDS

client lmv

Summary

LMV is a module which implements CMD client-side abstraction device. It allows client to work with many MDSes without any changes in Llite module and even without knowing that CMD is supported. Llite just translates Linux VFS requests into metadata API calls and forwards them down to the stack.

As LMV needs to know which MDS to talk for any particular operation, it uses some new services introduced in CMD3 times. They are:

  • FLD (Fids Location Database) - having FID or rather its sequence, lookup MDS number where this FID is located;
  • SEQ (Client Sequence Manager) - LMV uses this via children MDCs for allocating new sequences and FIDs.

LMV supports split objects. This means that for every split directory it creates special in-memory structure which contains information about object stripes. This includes MDS number, FID, etc. All consequent operations use these structures for determining what MDS should be used for particular action (create, take lock, etc).

Code

lmv/*.[ch]

lov

Summary

The LOV device presents a single virtual device interface to upper layers (llite, liblustre, MDS). The LOV code is responsible for splitting of requests to the correct OSTs based on striping information (lsm), and the merging of the replies to a single result to pass back to the higher layer.

It calculates per-object membership and offsets for read/write/truncate based on the virtual file offset passed from the upper layer. It is also responsible for splitting the locking across all servers as needed.

The LOV on the MDS is also involved in object allocation.

Code

lustre/lov/lov_ea.c          - Striping attributes pack/unpack/verify
lustre/lov/lov_internal.h    - Header for internal function prototypes/structs
lustre/lov/lov_merge.c       - Struct aggregation from many objects
lustre/lov/lov_obd.c         - Base LOV device configuration
lustre/lov/lov_offset.c      - File offset and object calculations
lustre/lov/lov_pack.c        - Pack/unpack of striping attributes
lustre/lov/lov_qos.c         - Object allocation for different OST loading
lustre/lov/lov_request.c     - Request handling/splitting/merging
lustre/lov/lproc_lov.c       - /proc/fs/lustre/lov tunables/statistics

quota

Summary

Quotas allow a system administrator to limit the maximum amount of disk space a user or group can consume. Quotas are set by root, and can be specified for individual users and/or groups. Quota limits can be set on both blocks and inodes.

Lustre quota enforcement differs from standard Linux quota support in several ways:

  • Lustre quota are administered via the lfs command, whereas standard Linux quota uses the quotactl interface.
  • As Lustre is a distributed filesystem, lustre quotas are also distributed in order to limit the impact on performance.
  • Quotas are allocated and consumed in a quantized fashion.

Code

Quota core:

lustre/quota/quota_adjust_qunit.c
lustre/quota/quota_check.c
lustre/quota/quotacheck_test.c
lustre/quota/quota_context.c
lustre/quota/quota_ctl.c
lustre/quota/quota_interface.c
lustre/quota/quota_internal.h
lustre/quota/quota_master.c 

Interactions with the underlying ldiskfs filesystem:

lustre/lvfs/fsfilt_ext3.c
lustre/lvfs/lustre_quota_fmt.c
lustre/lvfs/lustre_quota_fmt_convert.c 

Hooks under:

lustre/mds
lustre/obdfilter 

Regression tests:

lustre/tests/sanity-quota.sh

security-gss

Summary

The secure ptlrpc (sptlrpc) is a framework inside of ptlrpc layer. It act upon both side of each ptlrpc connection between 2 nodes, doing transformation on every RPC message, turn this into a secure communication link. By using GSS, sptlrpc is able to support multiple authentication mechanism, but currently we only support Kerberos 5.

Supported security flavors:

  • null: no authentication, no data transform, thus no performance overhead; compatible with 1.6;
  • plain: no authentication, simple data transform, minimal performance overhead;
  • krb5x: per-user basis client-server mutual authentication using Kerberos 5, sign or encrypt data, could have substantial CPU overhead.

Code

lustre/ptlrpc/sec*.c
lustre/ptlrpc/gss/
lustre/utils/gss/

security-capa

Summary

Capabilities are pieces of data generated by one service - the master service, passed to a client and presented by the client to another service - the slave service, to authorize an action. It is independent from the R/W/X permission based file operation authorization.

Code

lustre/llite/llite_capa.c
lustre/mdt/mdt_capa.c
lustre/obdfilter/filter_capa.c
lustre/obdclass/capa.c
lustre/include/lustre_capa.h

security-identity

Summary

Lustre identity is a miscellaneous framework for lustre file operation authorization. Generally, it can be divided into two parts:

  • User-identity parse / upcall / mapping.
  • File operation permission maintenance and check, includes the traditional file mode based permission and ACL based permission.

Code

/llite/llite_rmtacl.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_lib.c
lustre/obdclass/idmap.c
lustre/utils/l_getidentity.c
lustre/include/lustre_idmap.h 
lustre/llite/xattr.c
lustre/mdt/mdt_xattr.c
lustre/cmm/cmm_object.c
lustre/cmm/mdc_object.c
lustre/mdd/mdd_permission.c
lustre/mdd/mdd_object.c
lustre/mdd/mdd_dir.c
lustre/obdclass/acl.c
lustre/include/lustre_eacl.h

OST

Summary

OST is a very thin layer of data server. Its main responsibility is to translate RPCs to local calls of obdfilter, i.e. RPC parsing.

Code

lustre/ost/*.[ch]

ldiskfs

Summary

ldiskfs is local disk filesystem built on top of ext3. it adds extents support to ext3, multiblock allocator, multimount protection and iopen features.

Code

There is no ldiskfs source code in the Lustre repositories (only patches). Instead, ext3 code is copied from your build kernel, the patches are applied and then whole thing gets renamed to ldiskfs. For details, go to ldiskfs/.

fsfilt

Summary

The fsfilt layer abstracts the backing filesystem specifics away from the obdfilter and mds code in 1.4 and 1.6 lustre. This avoids linking the obdfilter and mds directly against the filesystem module, and in theory allows different backing filesystems, but in practise this was never implemented. In Lustre 1.8 and later this code is replaced by the OSD layer.

There is a core fsfilt module which can auto-load the backing filesystem type based on the type specified during configuration. This loads a filesystem-specific fsfilt_{fstype} module with a set of methods for that filesystem.

There are a number of different kinds of methods:

  • Get/set filesystem label and UUID for identifying the backing filesystem
  • Start, extend, commit compound filesystem transactions to allow multi-file updates to be atomic for recovery
  • Set a journal callback for transaction disk commit (for Lustre recovery)
  • Store attributes in the inode (possibly avoiding side-effects like truncation when setting the inode size to zero)
  • Get/set file attributes (EAs) for storing LOV and OST info (e.g. striping)
  • Perform low-level IO on the file (avoiding cache)
  • Get/set file version (for future recovery mechanisms)
  • Access quota information

Code

The files used for the fsfilt code reside in:

lustre/lvfs/fsfilt.c         - Interface used by obdfilter/MDS, module autoloading
lustre/lvfs/fsfilt_ext3.c    - Interface to ext3/ldiskfs filesystem

The fsfilt_ldiskfs.c file is auto-generated from fsfilt_ext3.c in lustre/lvfs/autoMakefile.am using sed to replace instances of ext3 and EXT3 with ldiskfs, and a few other replacements to avoid symbol clashes.

ldiskfs OSD

Summary

ldiskfs-OSD is an implementation of dt_{device,object} interfaces on top of (modified) ldiskfs file-system.

It uses standard ldiskfs/ext3 code to do file I/O.

It supports 2 types of indices (in the same file system):

  • iam-based index: this is an extension of ext3 htree directory format with support for more general keys and values, and with relaxed size restrictions, and
  • compatibility index: this is usual ldiskfs directory, accessible through dt_index_operations.

ldiskfs-OSD uses read-write mutex to serialize compound operations. </blockquote>

Code

lustre/include/dt_object.h
lustre/osd/osd_internal.h
lustre/osd/osd_handler.c

DMU OSD

Summary

This is another implementation of the OSD API for userspace DMU. It uses DMU's ZAP for indices.

Code

dmu-osd/*.[ch] in b_hd_dmu branch

DMU

Summary

The DMU is one of the layers in Sun's ZFS filesystem which is responsible for presenting a transactional object store to its consumers. It is used as Lustre's backend object storage mechanism for the userspace MDSs and OSSs.

The ZFS community page has a source tour which is useful as an introduction to the several ZFS layers: ZFS source

There are many useful resources in that community page.

For reference, here's a list of DMU features:

  • Atomic transactions
  • End-to-end data and metadata checksumming (currently supports fletcher2, fletcher4 and sha-256)
  • Compression (currently supports lzjb and gzip with compression levels 1..9)
  • Snapshots and clones
  • Variable block sizes (currently supports sector sizes from 512 bytes to 128KB)
  • Integrated volume management with support for RAID-1, RAID-Z and RAID-Z2 and striping
  • Metadata and optional data redundancy (ditto blocks) atop the inherent storage pool redundancy for high resilience
  • Self-healing, which works due to checksumming, ditto blocks and pool redundancy
  • Storage devices that act as level-2 caches (designed for flash storage)
  • Hot spares
  • Designed with scalability in mind - supports up to 2^64 bytes per object, 2^48 objects per filesystem, 2^64 filesystems per pool, 2^64 bytes per device, 2^64 devices per pool, ..
  • Very easy to use admin interface (zfs and zpool commands)

Code

src/
source code

src/cmd/              - ZFS/DMU related programs
src/cmd/lzfs/         - lzfs, the filesystem administration utility
src/cmd/lzpool/       - lzpool, the pool administration utility
src/cmd/lzdb/         - lzdb, the zfs debugger
src/cmd/lztest/       - lztest, the DMU test suite
src/cmd/lzfsd/        - lzfsd, the ZFS daemon

src/lib/              - Libraries
src/lib/port/         - Portability layer
src/lib/solcompat/    - Solaris -> Linux portability layer (deprecated, use libport instead)
src/lib/avl/          - AVL trees, used in many places in the DMU code
src/lib/nvpair/       - Name-value pairs, used in many places in the DMU code
src/lib/umem/         - Memory management library
src/lib/zpool/        - Main ZFS/DMU code
src/lib/zfs/          - ZFS library used by the lzfs and lzpool utilities
src/lib/zfscommon/    - Common ZFS code between libzpool and libzfs
src/lib/ctl/          - Userspace control/management interface
src/lib/udmu/         - Lustre uDMU code (thin library around the DMU)

src/scons/            - local copy of SCons

tests/regression/     - Regression tests

misc/                 - miscellaneous files/scripts

obdfilter

Summary

obdfilter is a core component of OST (data server) making underlying disk filesystem a part of distributed system:

  • Maintains cluster-wide coherency for data
  • Maintains space reservation for data in client's cache (grants)
  • Maintains quota

Code

lustre/obdfilter/*.[ch]

MDS

Summary

The MDS service in Lustre 1.4 and 1.6 is a monolithic body of code that provides multiple functions related to filesystem metadata. It handles the incoming RPCs and service threads for metadata operations (create, rename, unlink, readdir, etc), interfaces with the Lustre distributed lock manager (ldlm), and also manages the underlying filesystem (via the fsfilt interface).

The MDS is the primary point of access control for clients, allocates the objects belonging to a file (in conjunction with LOV) and passing that information to the clients when they access a file.

The MDS is also ultimately responsible for deleting objects on the OSTs, either by passing object information for destroy to the client removing the last link or open reference on a file and having the client do it, or by destroying the objects on the OSTs itself in case the client fails to do so.

In the 1.8 and later releases, the functionality provided by the MDS code has been split into multiple parts (MDT, MDD, OSD) in order to allow stacking of the metadata devices for clustered metadata.

Code

lustre/mds/commit_confd.c
lustre/mds/handler.c            - RPC request handler
lustre/mds/lproc_mds.c          - /proc interface for stats/control
lustre/mds/mds_fs.c             - Mount/configuration of underlying filesystem
lustre/mds/mds_internal.h       - Header for internal declarations
lustre/mds/mds_join.c           - Handle join_file operations
lustre/mds/mds_lib.c            - Unpack of wire structs from requests
lustre/mds/mds_log.c            - Lustre log interface (llog) for unlink/setattr
lustre/mds/mds_lov.c            - Interface to LOV for create and orphan
lustre/mds/mds_open.c           - File open/close handling
lustre/mds/mds_reint.c          - Reintegration of changes made by clients
lustre/mds/mds_unlink_open.c    - Handling of open-unlinked files (PENDING dir)
lustre/mds/mds_xattr.c          - User-extended attribute handling

MDT

Summary

MDT stands for MetaData Target. This is a top-most layer in the MD server device stack. Responsibility of MDT are all this networking, as far as meta-data are concerned:

  • Managing PTLRPC services and threads;
  • Receiving incoming requests, unpacking them and checking their validity;
  • Sending replies;
  • Handling recovery;
  • Using DLM to guarantee cluster-wide meta-data consistency;
  • Handling intents;
  • Handling credential translation.

Theoretically MDT is an optional layer: completely local Lustre setup, with single mete-data server, and locally mounted client can exist without MDT (and still use networking for non-metadata access).

Code

lustre/mdt/mdt.mod.c
lustre/mdt/mdt_capa.c
lustre/mdt/mdt_handler.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_internal.h
lustre/mdt/mdt_lib.c
lustre/mdt/mdt_lproc.c
lustre/mdt/mdt_open.c
lustre/mdt/mdt_recovery.c
lustre/mdt/mdt_reint.c
lustre/mdt/mdt_xattr.c

CMM

Summary

Overview

The CMM is a new layer in the MDS which cares about all clustered metadata issues and relationships. The CMM does the following:

  • Acts as layer between the MDT and MDD.
  • Provides MDS-MDS interaction.
  • Queries and updates FLD.
  • Does the local or remote operation if needed.
  • Will do rollback - epoch control, undo logging.

CMM functionality

CMM chooses all servers involved in operation and sends depended request if needed. The calling of remote MDS is a new feature related to the CMD. CMM mantain the list of MDC to connect with all other MDS.

Objects

The CMM can allocate two types of object - local and remote. Remote object can occur during metadata operations with more than one object involved. Such operation is called as cross-ref operation.

Code

lustre/cmm

MDD

Summary

MDD is metadata layer in the new MDS stack, which is the only layer operating the metadata in MDS. The implementation is similar as VFS meta operation but based on OSD storage. MDD API is currently only used in new MDS stack, called by CMM layer.

In theory, MDD should be local metadata layer, but for compatibility with old MDS stack and reuse some mds codes(llog and lov), a mds device is created and connected to the mdd. So the llog and lov in mdd still use original code through this temporary mds device. And it will be removed when the new llog and lov layer in the new MDS stack are implemented.

Code

lustre/lustre/mdd/

recovery

Summary

Overview

Client recovery starts in case when no server reply is received within given timeout or when server tells to client that it is not connected (client was evicted on server earlier for whatever reason).

The recovery consists of trying to connect to server and then step through several recovery states during which various client-server data is synchronized, namely all requests that were already sent to server but not yet confirmed as received and DLM locks. Should any problems arise during recovery process (be it a timeout or server’s refuse to recognise client again), the recovery is restarted from the very beginning.

During recovery all new requests to the server are not sent to the server, but added to special delayed requests queue that is then sent once if recovery completes succesfully.

Replay and Resend

  • Clients will go through all the requests in the sending and replay lists and determine the recovery action needed - replay request, resend request, cleanup up associated state for committed requests.
  • The client replays requests which were not committed on the server, but for which the client saw reply from server before it failed. This allows the server to replay the changes to the persistent store.
  • The client resends requests that were committed on the server, but the client did not see a reply for them, maybe due to server failure or network failure that caused the reply to be lost. This allows the server to reconstruct the reply and send it to the client.
  • The client resends requests that the server has not seen at all, these would be all requests with transid higher than the last_rcvd value from the server and the last_committed transno, and the reply seen flag is not set.
  • The client gets the last_committed transno information from the server and cleans up the state associated with requests that were committed on the server.

Code

Recovery code is scattered through all code almost. Though important code:

ldlm/ldlm_lib.c    - generic server recovery code 
ptlrpc/            - client recovery code

version recovery

Summary

Version Based Recovery

This recovery technique is based on using versions of objects (inodes) to allow clients to recover later than ordinary server recovery timeframe.

  1. The server changes the version of object during any change and return that data to client. The version may be checked during replay to be sure that object is the same state during replay as it was originally.
  2. After failure the server starts recovery as usual but if some client miss the version check will be used for replays.
  3. Missed client can connect later and try to recover. This is 'delayed recovery' and version check is used during it always.
  4. The client which missed main recovery window will not be evicted and can connect later to initiate recovery. In that case the versions will checked to determine was that object changed by someone else or not.
  5. When finished with replay, client and server check if any replay failed on any request because of version mismatch. If not, the client will get a successful reintegration message. If a version mismatch was encountered, the client must be evicted.

Code

Recovery code is scattered through all code almost. Though important code:

ldlm/ldlm_lib.c    - generic server recovery code 
ptlrpc/            - client recovery code

IAM

Summary

IAM stands for 'Index Access Module': it is an extension to the ldiskfs directory code, adding generic indexing capability.

File system directory can be thought of as an index mapping keys, which are strings (file names), to the records which are integers (inode numbers). IAM removes limitations on key and record size and format, providing an abstraction of a transactional container, mapping arbitrary opaque keys into opaque records.

Implementation notes:

  • IAM is implemented as a set of patches to the ldiskfs;
  • IAM is an extension of ldiskfs directory code that uses htree data-structure for scalable indexing;
  • IAM uses fine-grained key-level and node-level locking (pdirops locking, designed and implemented by Alex Tomas);
  • IAM doesn't assume any internal format keys. Keys are compared by memcmp() function (which dictates BE order for scalars);
  • IAM supports different flavors of containers:
    • lfix: fixed size record and fixed size keys,
    • lvar: variable sized records and keys,
    • htree: compatibility mode, allowing normal htree directory to be accessed as an IAM container;
  • IAM comes with ioctl(2) based user-level interface.

IAM is used by ldiskfs-OSD to implement dt_index_operations interface. </blockquote>

Code

lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-ops.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-rhel5.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-vanilla.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-separate.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.9-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-common.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-uapi.patch

SOM

Summary

Size-on-MDS is a metadata improvement, which includes the caching of the inode size, blocks, ctime and mtime on MDS. Such an attribute caching allows clients to avoid making RPCs to the OSTs to find the attributes encoded in the file objects kept on those OSTs what results in the significantly improved performance of listing directories.

Code

llite/llite_close.c       - client-side SOM code
liblustre/file.c          - liblustre SOM code
mdt/mdt_handler.c         - general handling of SOM-related rpc
mdt/mdt_open.c            - MDS side SOM code 
mdt/mdt_recovery.c        - MDS side SOM recovery code
obdfilter/filter_log.c    - OST side IO epoch lloging code;

tests

Summary

The "tests" subsystem is a set of scripts and programs which is used to test other lustre subsystems. It contains:

runtests
Simple basic regression test

sanity
A set of regression tests that verify operation under normal operating conditions

fsx
file system exerciser

sanityn
Tests that verify operations from two clients under normal operating conditions

lfsck
Tests e2fsck and lfsck to detect and fix filesystem corruption

liblustre
Runs a test linked to a liblustre client library

replay-single
A set of unit tests that verify recovery after MDS failure

conf-sanity
A set of unit tests that verify the configuration

recovery-small
A set of unit tests that verify RPC replay after communications failure

replay-ost-single
A set of unit tests that verify recovery after OST failure

replay-dual
A set of unit tests that verify the recovery from two clients after server failure

insanity
A set of tests that verify the multiple concurrent failure conditions

sanity-quota
A set of tests that verify filesystem quotas

The acceptance-small.sh is a wrapper which is normally used to run all (or any) of these scripts. In additional it is used to run the following pre-installed benchmarks:

dbench
Dbench benchmark for simulating N clients to produce the filesystem load

bonnie
Bonnie++ benchmark for creation, reading, and deleting many small files

iozone
Iozone benchmark for generating and measuring a variety of file operations.

Code

lustre/tests/acl/run
lustre/tests/acl/make-tree
lustre/tests/acl/README
lustre/tests/acl/setfacl.test
lustre/tests/acl/getfacl-noacl.test
lustre/tests/acl/permissions.test
lustre/tests/acl/inheritance.test
lustre/tests/acl/misc.test
lustre/tests/acl/cp.test
lustre/tests/cfg/local.sh
lustre/tests/cfg/insanity-local.sh
lustre/tests/ll_sparseness_write.c
lustre/tests/writeme.c
lustre/tests/cobd.sh
lustre/tests/test_brw.c
lustre/tests/ll_getstripe_info.c
lustre/tests/lov-sanity.sh
lustre/tests/sleeptest.c
lustre/tests/flocks_test.c
lustre/tests/getdents.c
lustre/tests/ll_dirstripe_verify.c
lustre/tests/sanity.sh
lustre/tests/multifstat.c
lustre/tests/sanityN.sh
lustre/tests/liblustre_sanity_uml.sh
lustre/tests/fsx.c
lustre/tests/small_write.c
lustre/tests/socketserver
lustre/tests/cmknod.c
lustre/tests/README
lustre/tests/acceptance-metadata-double.sh
lustre/tests/writemany.c
lustre/tests/llecho.sh
lustre/tests/lfscktest.sh
lustre/tests/run-llog.sh
lustre/tests/conf-sanity.sh
lustre/tests/mmap_sanity.c
lustre/tests/write_disjoint.c
lustre/tests/ldaptest.c
lustre/tests/acceptance-metadata-single.sh
lustre/tests/compile.sh
lustre/tests/mcreate.c
lustre/tests/runas.c
lustre/tests/replay-single.sh
lustre/tests/lockorder.sh
lustre/tests/test2.c
lustre/tests/llog-test.sh
lustre/tests/fchdir_test.c
lustre/tests/mkdirdeep.c
lustre/tests/runtests
lustre/tests/flock.c
lustre/tests/mlink.c
lustre/tests/checkstat.c
lustre/tests/crash-mod.sh
lustre/tests/multiop.c
lustre/tests/random-reads.c
lustre/tests/disk1_4.zip
lustre/tests/rundbench
lustre/tests/wantedi.c
lustre/tests/rename_many.c
lustre/tests/leak_finder.pl
lustre/tests/Makefile.am
lustre/tests/parallel_grouplock.c
lustre/tests/chownmany.c
lustre/tests/ost_oos.sh
lustre/tests/mkdirmany.c
lustre/tests/directio.c
lustre/tests/insanity.sh
lustre/tests/createmany-mpi.c
lustre/tests/createmany.c
lustre/tests/runiozone
lustre/tests/rmdirmany.c
lustre/tests/replay-ost-single.sh
lustre/tests/mcr.sh
lustre/tests/mrename.c
lustre/tests/sanity-quota.sh
lustre/tests/lp_utils.c
lustre/tests/lp_utils.h
lustre/tests/acceptance-metadata-parallel.sh
lustre/tests/oos.sh
lustre/tests/createdestroy.c
lustre/tests/toexcl.c
lustre/tests/replay-dual.sh
lustre/tests/createtest.c
lustre/tests/munlink.c
lustre/tests/iopentest1.c
lustre/tests/iopentest2.c
lustre/tests/openme.c
lustre/tests/openclose.c
lustre/tests/test-framework.sh
lustre/tests/ll_sparseness_verify.c
lustre/tests/it_test.c
lustre/tests/unlinkmany.c
lustre/tests/opendirunlink.c
lustre/tests/filter_survey.sh
lustre/tests/utime.c
lustre/tests/openunlink.c
lustre/tests/runvmstat
lustre/tests/statmany.c
lustre/tests/create.pl
lustre/tests/oos2.sh
lustre/tests/statone.c
lustre/tests/rename.pl
lustre/tests/set_dates.sh
lustre/tests/openfilleddirunlink.c
lustre/tests/openfile.c
lustre/tests/llmountcleanup.sh
lustre/tests/llmount.sh
lustre/tests/acceptance-small.sh
lustre/tests/truncate.c
lustre/tests/recovery-small.sh
lustre/tests/2ost.sh
lustre/tests/tchmod.c
lustre/tests/socketclient
lustre/tests/runobdstat
lustre/tests/memhog.c
lustre/tests/flock_test.c
lustre/tests/busy.sh
lustre/tests/write_append_truncate.c
lustre/tests/opendevunlink.c
lustre/tests/o_directory.c

build

Summary

The build system is responsible for building Lustre and related components (ldiskfs is normally included in the Lustre tree but can also live completely separately).

The main build process is managed using GNU Autoconf and Automake. Here is a brief outline of how a Lustre binary build from a fresh Git checkout works. User commands are shown in bold.

  • sh autogen.sh - autogen performs a few checks and bootstraps the build system using automake and autoconf. It should only need to be called once for a fresh Git clone, but sometimes it needs to be run again. See bug 12580.
    • Each component (Lustre and ldiskfs) has an autoMakefile.am in its toplevel directory that sets some variables and includes build/autoMakefile.am.toplevel. It also contains any toplevel autoMake code unique to that component.
    • configure.ac is used by autoconf to generate a configure script. The Lustre configure.ac mostly relies on the macros defined in */autoconf/*.m4 to do its work. The ldiskfs configure.ac is more self-contained and relies only on build/autoconf/*.m4.
  • ./configure --with-linux=/root/cfs/kernels/linux-2.6.9-55.EL.HEAD - Configure performs extensive checks of the underlying system and kernel, then produces autoMakefiles and Makefiles.
  • make - This is where things get really interesting.
    • The @INCLUDE_RULES@ directive in most Makefile.in files includes a whole set of build rules from build/Makefile. See the top of that file for a description of all cases.
    • Normally, it will include autoMakefile, so commands from that file will run.
    • build/autoMakefile.am.toplevel is the basis of the autoMakefile produced in the toplevel directory. It includes the "modules" target.
    • The modules target in turn calls the appropriate Linux make system if we are building on Linux.
    • This build system once again reads the Makefile in each directory, and case 2 from build/Makefile is followed.

So essentially, the Makefile.in controls the kernel build process, and the autoMakefile.am controls the userland build process as well as preparing the sources if necessary.

The build system can also be used to produce Lustre-patched kernels and binaries built against these kernels. The build/lbuild script does this - this is used by customers as well as the LTS. This script is in need of some serious cleanup, unfortunately.

Coding style note: as mentioned in Coding Guidelines, autoconf macros must follow the style specified in the GNU Autoconf manual. A lot of the older code has inconsistent style and is hard to follow - feel free to reformat when needed. New code must be styled correctly.

Code

Lustre build system:

  • build/* (shared with ldiskfs)
  • autogen.sh
  • autoMakefile.am
  • configure.ac
  • lustre.spec.in
  • Makefile.in
  • all autoMakefile.am files
  • all Makefile.in files

ldiskfs build system:

  • build/* (shared with Lustre)
  • autogen.sh
  • autoMakefile.am
  • configure.ac
  • lustre-ldiskfs.spec.in
  • Makefile.in
  • all autoMakefile.am files
  • all Makefile.in files