Introduction to Lustre Object Storage Devices (OSDs)

Overview

All persistent information for a Lustre file system is contained on block storage file systems distributed across a set of storage servers. Lustre’s architecture is built on a distributed object storage model where back-end block storage is abstracted by an API called the Object Storage Device, or OSD. The OSD enables Lustre to use different back-end file systems for persistence, of which LDISKFS (based on EXT4) and ZFS are the two currently implemented. The term OSD is also used as a generic term for an instance of a Lustre storage target, such as an MGT, MDT or OST. In this case, what is meant is any storage device or LUN that has been formatted with Lustre on-disk data structures. For example, if an action can be applied to any storage target, the term OSD will normally be used, rather than writing out all three types of storage.

Lustre objects can either be data objects holding a byte stream for file data, or index objects, which are typically used for metadata such as directory information.

A single OSD instance corresponds to precisely one back-end storage volume. Physical devices are assembled into logical units or volumes and these are used to create OSD instances. Lustre has three types of OSD instance, corresponding to the types of Lustre services. These are:

Management Target (MGT): used by the Management Service (MGS) to maintain file system configuration data used by all hosts in the Lustre environment.
Metadata Target (MDT): used by the Metadata Service (MDS) to record the file system name space (file and directory information for an instance of Lustre)
Object Storage Target (OST): used by the Object Storage Service (OSS) to record data objects representing the contents of files.

At-a-Glance Comparison of Lustre with LDISKFS vs ZFS

Feature	LDISKFS	ZFS
Maximum Volume Size	50TB [1EB]	256TB [256ZB]
Maximum Native File Size	16TB	256TB
Maximum Lustre file size	32PB	512PB
Maximum Lustre file system	512PB	8EB
Maximum inodes per MDT	4B	unlimited
Native data protection (RAID)	None	Mirror, RAIDZ{1, 2, 3}, DRAID (future)
Data integrity	Journal	Block level checksums
Protect against write hole	None / hardware dependent	Copy on Write transactions; Dynamic stripe width
Detect, repair silent data corruption	None / hardware dependent	Yes
Compression (maximize GB/$)	No	Yes
Volume Management	None	Integrated powerful and simple management tools
Snapshot	None	Integrated. Lustre support available
Hybrid storage tiering	None	Integrated support for read I/O cache accelerators
File system repair	Offline FSCK. Requires outage	ZFS Scrub: online repair, no downtime

The maximum realistic usable value for EXT4 volume size is based on information from Red Hat:

https://access.redhat.com/articles/rhel-limits

Although the 50TB limit is conservative, it does serve to demonstrate the difficulty of expanding existing Linux file systems to meet future demand.

Theoretical values for ZFS are far in excess of what can currently be tested. ZFS helps to ensure that the fundamentals of Lustre scalability will not be compromised by limitations in the underlying storage. ZFS enables Lustre storage systems to scale with user demand.

The ZFS values for maximum volume size and native file size represent limits in current testing, not a limit in ZFS itself. ZFS maximum theoretical volume size is 256ZB (Zettabytes) and ZFS maximum theoretical individual files size is 16EB.

The Lustre maximum theoretical file size based on 16EB objects striped across 2000 OSTS is 32,0000EB.

Lustre maximum theoretical file system size when using ZFS is 8150 x volume size, which is 2,086,400ZB. A more realistic maximum may be to consider a small number of exabytes, perhaps a limit of 32,000 EB. Insert appropriate Unicode emoticon here to represent level of astonishment / cynicism, as required to meet reader's emotional response.

The point is: ZFS is designed to scale. Lustre builds on that scalability even further to continue to deliver massive performance and usable capacity across a single, coherent, POSIX namespace.

Formatting Lustre Storage

Lustre storage has to be formatted for each supported service: MGT, MDT or OST. The mkfs.lustre command is supplied with the Lustre server software for this purpose. The general command syntax follows:

mkfs.lustre { --mgs | --mdt | --ost } \
  [ --reformat ] \
  [ --fsname <name> ] \
  [ --index <n> ] \
  [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  [ --backfstype=zfs ] \
  [ --mkfsoptions <options> ] \
  { <pool name>/<dataset> [<zpool specification>]] | <device> }

Formatting storage targets is covered in detail in separate articles for MGT, MDT and OST storage devices; for now, just be aware that all Lustre OSDs are created using the same mkfs.lustre command-line application.

The purpose of the storage target is determined at format time through selection of one of --mgs, --mdt or --ost. The software also defines the extent to which a storage target is to be made highly-available, through the --failnode or --servicenode flags. For MDTs and OSTs, the administrator must supply a name (--fsname) for the Lustre file system to which they are allocated, and an index number (--index), which is a unique non-negative integer within the file system storage population.

Additional formatting options can be supplied to the underlying backing store file system with the --mkfsoptions parameter; EXT4 and ZFS file system datasets options can be supplied, but note that ZFS zpool options cannot be included directly using mkfs.lustre.

When working with ZFS-based storage, one can use the mkfs.lustre command to assemble the ZFS pools, and also create the file system datasets that will contain the Lustre on-disk data. In this case, the ZFS pool specification is supplied along with the pool name and dataset name. However, this approach is not always suitable when working with production configurations because it does not allow an administrator to set or override properties of the ZFS pool.

Defining Service Failover (--failnode vs --servicenode)

All Lustre file system storage services are associated with block storage targets that contain a set of data for a given file system. The content varies by service type, but the sum of all of the data on all of the storage targets represents each filesystem as a whole. Loss of access to a service, for example because the host running the service has crashed, effectively means loss of access to the associated storage targets and the data they contain.

To protect against server loss, Lustre makes use of a standard high availability paradigm called failover, whereby a service can be run on one of a set of hosts, and if it fails, one of the surviving hosts in the set can restart the services that were running on the host that failed. For this to work with Lustre, the storage targets must be accessible by each host in the failover group. This is accomplished by connecting the servers to one or more arrays of stand-alone, shared storage (i.e., storage in a self-contained enclosure that is separate from any individual server).

Lustre services are usually grouped into HA pairs – two servers connected to a common pool of shareable block storage, although refer to the Scalable high availability for Lustre with Pacemaker LUG 2017 presentation for an interesting alternative architecture.

Failover is an attribute of each Lustre storage target, and is written into each storage target’s configuration. Each target contains a list of the host NIDs that are able to mount the storage and present it to the network. This list of NIDs is registered with the MGS on mount, so that the clients know which NIDs to connect to. If one of the NIDs does not respond, the client service will try the next NID in the list for that service. When it runs out of unique NIDs, the client will retry the list in sequence from the top until a connection is made. By using failover with shared storage, Lustre is resilient to failures and clients are able to tolerate temporary outages in server infrastructure.

There are two ways to define the failover configuration for a Lustre storage target:

failnode: this is the original method for defining failover groups, and it creates a configuration where there is a primary or preferred host for a given target, and one or more failover hosts.
servicenode: this is a newer option for defining the set of hosts where a given target can be mounted. With servicenode, there is no defined primary node for a service. Notionally, all hosts are equally able to run a given service, with no defined preferred primary. This is the recommended method for defining failover information.

Each method is valid and supported by Lustre, but the methods are mutually incompatible. A storage target can contain either a failnode configuration or a servicenode configuration, but not both.

Using the failnode configuration syntax, the administrator lists the set of failover nodes that a storage target can be accessed from, but does not explicitly define the primary node. The primary node is not determined until the first time that a formatted storage target is mounted, at which point the configuration is updated with the NID of the server where the mount command is executed.

This means that the primary node must be online and available for service when the storage target is mounted the first time. It also means that there is potential for mistakes to creep into process execution, because the failnode configuration is dependent on a very specific start-up sequence for the first time mount of a Lustre device.

Also note that the tunefs.lustre command will only list the hosts listed as "failnodes" in the command line, and this does not get updated with information about the primary node, even after the OSD is mounted. For example:

[root@rh7z-mds1 ~]# mkfs.lustre --reformat --mgs \
  --failnode 192.168.227.12@tcp1 \
  --backfstype=zfs \
  mgspool/mgt

   Permanent disk data:
Target:     MGS
Index:      unassigned
Lustre FS:  
Mount type: zfs
Flags:      0x64
              (MGS first_time update )
Persistent mount opts: 
Parameters: failover.node=192.168.227.12@tcp1

mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt
Writing mgspool/mgt properties
  lustre:version=1
  lustre:flags=100
  lustre:index=65535
  lustre:svname=MGS
  lustre:failover.node=192.168.227.12@tcp1

[root@rh7z-mds1 ~]# tunefs.lustre --dryrun mgspool/mgt
checking for existing Lustre data: found

   Read previous values:
Target:     MGS
Index:      unassigned
Lustre FS:  
Mount type: zfs
Flags:      0x44
              (MGS update )
Persistent mount opts: 
Parameters: failover.node=192.168.227.12@tcp1


   Permanent disk data:
Target:     MGS
Index:      unassigned
Lustre FS:  
Mount type: zfs
Flags:      0x44
              (MGS update )
Persistent mount opts: 
Parameters: failover.node=192.168.227.12@tcp1

exiting before disk write.

Note: If using failnode, and after formatting the storage, the target is mounted for the first time on the node with the secondary NID, then both the primary NID and the failover NID will be the same. The intended primary host would be excluded from being able to run the service.

The servicenode syntax defines a list of peers equally capable of mounting the storage target and there is no implied primary host. Any one of the defined peers can mount the storage and start the services for that storage. This method is much easier to implement and maintain, because all server NIDs are written directly into the configuration – there is no ambiguity about which hosts are associated with the storage target.

It is recommended that the servicenode method be used for creating HA failover configurations for Lustre storage targets, given the increased flexibility in adding new services, explicit definition of the hosts that are able to run the service, and the ability to better exploit the resource management features of HA software frameworks like Pacemaker.

Lustre Device and Mount Point Naming Conventions

There are three device types for Lustre storage: MGT, MDT and OST. These correspond to the MGS, MDS and OSS Lustre services, respectively. The following guidance has been developed as a recommended naming convention for Lustre’s persistent storage components:

Service Name	ZFS Pool Name*	ZFS Dataset Name*	Mount Point
MGS	mgspool	mgt	/lustre/mgt
MDS	<fsname>-mdt<n>pool	mdt<n>	/lustre/<fsname>/mdt<n>
OSS	<fsname>-ost<n>pool	ost<n>	/lustre/<fsname>/ost<n>
Client	n/a	n/a	/lustre/<fsname>

* ZFS Pool Name and Dataset name only apply to ZFS-based storage targets.

Note: the Lustre file system name is limited to eight characters. Yes, like DOS.

The naming of pools, datasets and mount points presented here is provided as a recommendation only. Administrators can make their own choices about naming. The convention chosen here has been designed to provide a standard for describing the components that is unambiguous and easy to interpret.

The MGT is the persistent data store for the MGS, which is a global resource, and the only Lustre service that is independent of any specific Lustre file system. As such, the ZFS dataset name and file system mount point do not make reference to a Lustre file system instance.

All other Lustre storage devices should make reference to the file system name.

Introduction to Lustre Object Storage Devices (OSDs)

Contents

Overview

At-a-Glance Comparison of Lustre with LDISKFS vs ZFS

Formatting Lustre Storage

Defining Service Failover (--failnode vs --servicenode)

Lustre Device and Mount Point Naming Conventions

Navigation menu

Introduction to Lustre Object Storage Devices (OSDs)

Overview

At-a-Glance Comparison of Lustre with LDISKFS vs ZFS

Formatting Lustre Storage

Defining Service Failover (--failnode vs --servicenode)

Lustre Device and Mount Point Naming Conventions

Navigation menu

Search