Lustre Tuning

From Lustre Wiki
Jump to: navigation, search

Many options in Lustre are set by means of kernel module parameters. These parameters are contained in the modprobe.conf file (On SuSE, this may be modprobe.conf.local).

OSS Service Thread Count

The oss_num_threads parameter allows the number of OST service threads to be specified at module load time on the OSS nodes:

options ost oss_num_threads={N}
options ost oss_max_threads={N}

An OSS defaults to a maximum of 512 service threads and a minimum of 2 service threads. The number of service threads is a function of how much RAM and how many CPUs are on each OSS node (1 thread / 256MB * NUMA regions). If the load on the OSS node is high, new service threads will be started in order to process more requests concurrently, up to 4x the initial number of threads (subject to the maximum of 512). For disk-based OSTs, a good starting point is 32 threads per OST.

Increasing the size of the thread pool may help when:

  • Several OSTs are exported from a single OSS
  • Back-end storage is running synchronously
  • I/O completions take excessive time due to slow storage

Decreasing the size of the thread pool may help if:

  • The clients are overwhelming the storage capacity
  • There are lots of "slow I/O" or similar messages

Increasing the number of I/O threads allows the kernel and storage to aggregate many writes together for more efficient disk I/O. The OSS thread pool is shared—each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal I/O buffers.

It is very important to consider memory consumption when increasing the thread pool size. Drives are only able to sustain a certain amount of parallel I/O activity before performance is degraded due to the high number of seeks and the OST threads just waiting for I/O. In this situation, it may be advisable to decrease the load by decreasing the number of OST threads.

Determining the optimum number of OST threads is a process of trial and error. You may want to start with a number of OST threads equal to the number of actual disk spindles on the node. If you use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and monitor the performance of clients during usual workloads. If performance is degraded, increase the thread count and see how that works until performance is degraded again or you reach satisfactory performance.

MDS Service Thread Count

There is a similar parameter for the number of MDS service threads:

options mds mds_num_threads={N}

The default number varies based on the server size up to a maximum of 64. The maximum number of threads (MDS_MAX_THREADS) is 1024.

Note: The OSS and MDS will automatically start new service threads dynamically in response to server loading. Setting the *_num_threads module parameter disables the automatic thread creation behavior.

LNET Tunables

Transmit and receive buffer size: With Lustre release 1.4.7 and later, ksocklnd now has separate parameters for the transmit and receive buffers.

 options ksocklnd tx_buffer_size=0 rx_buffer_size=0

If these parameters are left at the default (0), the system automatically tunes the transmit and receive buffer size. In almost every case, the defaults produce the best performance. Do not attempt to tune this unless you are a network expert.

irq_affinity: By default, this parameter is on. In the normal case on an SMP system, we would like our network traffic to remain local to a single CPU. This helps to keep the processor cache warm and minimizes the impact of context switches. This is especially helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs.

If you have an SMP platform with a single fast interface such as 10GB Ethernet and more than two CPUs, you may see performance improve by turning this parameter off, as always test to compare the impact.

Options for Formatting MDS and OST

The backing file systems on the MDS and OSTs are independent of each other, so the formatting parameters for them should not be same. The size of the MDS backing file system depends solely on how many inodes you want in the total Lustre file system. It is not related to the size of the aggregate OST space.

Planning for Inodes

Every time you create a file on a Lustre file system, it consumes one inode on the MDS and one inode for each OST object that the file is striped over (normally it is based on the default stripe count option -c, but this may change on a per-file basis). In ldiskfs file systems, inodes are pre-allocated at format time, so creating a new file does not consume any of the free blocks. However, this also means that the format-time options should be conservative as it is not possible to increase the number of inodes after the file system is formatted. But it is possible to add OSTs with additional space and inodes to the file system.

To be on the safe side, plan for 2.5KB per inode on the MDS. This is the default. For the OST, the amount of space taken by each object depends entirely upon the usage pattern of the users/applications running on the system. Lustre, by necessity, defaults to a very conservative estimate for the object size (64KB per object for OSTs below 10GiB in size, up to 1MB per object for OSTs over 16TiB in size). You can almost always increase this for file system installations. Many Lustre file systems have average file sizes over 1MB per object, and this can easily be calculated from existing filesystems by looking at the output of df and df -i dividing the total KB used by the total number of inodes used.

Sizing the MDT

When calculating the MDS size, the only important factor is the average size of files to be stored in the file system. If the average file size is, for example, 5MB and you have 100TB of usable OST space, then you need at least 100TB * 1024GB/TB * 1024MB/GB / 5MB/inode = 20 million inodes. We recommend that you have twice the minimum, that is, 40 million inodes in this example. At the default 2.5KiB of MDT space per inode, this works out to only 96GB of space for the MDT.

Conversely, if you have a very small average file size, for example 4KB, Lustre is most efficient if storing the small files directly on the MDT using the Data_on_MDT feature. In this case, the MDT(s) should be large enough to hold the data of the small files entirely, in addition to the normal inode data. This would mean about 6.5KiB of MDT space per inode.

Overriding Default Formatting Options

To override the default formatting options for any of the Lustre backing filesystems, use the --mkfsoptions='backing fs options' argument to mkfs.lustre to pass formatting options to the backing mkfs. For all options to format backing ldiskfs filesystems, see the mke2fs(8) man page; this section only discusses some Lustre-specific options. For ZFS filesystems, see the zpool(8) man page.

Number of Inodes for ldiskfs MDS

To override the bytes-per-inode ratio, which affects the number of inodes created for a given MDT size, use the option -i <bytes per inode>. For instance, --mkfsoptions='-i 4096' creates one inode per 4096 bytes of MDT filesystem space. Alternately, if you are specifying some absolute number of inodes, use the -N<number of inodes> option. You should not specify the -i option with an inode ratio below one inode per 1536 bytes in order to avoid problems running out of space on the MDT before all of the inodes are allocated. Instead, use the -N option.

A 2TB MDS by default will have 400M inodes. A 10TB MDT would hold 2B inodes. With an MDS inode ratio of 1536 bytes per inode, a 3TB MDS would hold 2B inodes, and a 6TB MDS would hold 4B inodes, which is the maximum number of inodes currently supported by ldiskfs.

Inode Size for MDS

Lustre uses "large" inodes on the backing file systems in order to efficiently store Lustre metadata with each file. On the MDS, each inode is at least 1024 bytes in size by default, while on the OST each inode is 512 bytes in size. Lustre (or more specifically the backing ldiskfs file system), also needs sufficient space left for other metadata like the journal (up to 4GB), bitmaps and directories. There are also a few regular files that Lustre uses to maintain cluster consistency.

To specify a larger inode size, use the -I <inodesize> option. We do NOT recommend specifying a smaller-than-default inode size, as this can lead to serious performance problems; and you cannot change this parameter after formatting the file system. The inode ratio must always be larger than the inode size.

Number of Inodes for OST

For OST file systems, it is normally advantageous to take local file system usage into account. Try and minimize the number of inodes created on each OST, while keeping enough margin for potential variance in future usage. This helps in reducing the format and e2fsck time, and makes more space available for data. The current default is to create one inode per 64KB of space in the OST file system, but in many environments, this is far too many inodes for the average file size. As a good rule of thumb, the OSTs should have at least twice as many inodes as expected to avoid running out of inodes before space:

num_ost_inodes = 2 * <num_mds_inodes> * <default_stripe_count> / <number_osts>

You can specify the number of inodes on the OST file systems via the -N<num_inodes> option to --mkfsoptions. Alternately, if you know the average file size, then you can also specify the OST inode ratio for the OST file systems via -i <average_file_size / (number_of_stripes * 2)>. For example, if the average file size is 80MB and there are by default 4 stripes per file then --mkfsoptions='-i 10485760' would be appropriate (10MB per OST inode).