| C H A P T E R 20 |
|
Lustre Tuning |
Many options in Lustre are set by means of kernel module parameters. These parameters are contained in the modprobe.conf file (On SuSE, this may be modprobe.conf.local).
The oss_num_threads parameter enables the number of OST service threads to be specified at module load time on the OSS nodes:
options ost oss_num_threads={N}
After startup, the minimum and maximum number of OSS thread counts can be set via the {service}.thread_{min,max,started} tunable. To change the tunable at runtime, run:
lctl {get,set}_param {service}.thread_{min,max,started}
For details, see Setting MDS and OSS Thread Counts.
An OSS can have a minimum of 2 service threads and a maximum of 512 service threads. The number of service threads is a function of how much RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the OSS node is high, new service threads will be started in order to process more requests concurrently, up to 4x the initial number of threads (subject to the maximum of 512). For a 2GB 2-CPU system, the default thread count is 32 and the maximum thread count is 128.
Increasing the size of the thread pool may help when:
Decreasing the size of the thread pool may help if:
Increasing the number of I/O threads allows the kernel and storage to aggregate many writes together for more efficient disk I/O. The OSS thread pool is shared--each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal I/O buffers.
It is very important to consider memory consumption when increasing the thread pool size. Drives are only able to sustain a certain amount of parallel I/O activity before performance is degraded, due to the high number of seeks and the OST threads just waiting for I/O. In this situation, it may be advisable to decrease the load by decreasing the number of OST threads.
Determining the optimum number of OST threads is a process of trial and error, and varies for each particular configuration. Variables include the number of OSTs on each OSS, number and speed of disks, RAID configuration, and available RAM. You may want to start with a number of OST threads equal to the number of actual disk spindles on the node. If you use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and monitor the performance of clients during usual workloads. If performance is degraded, increase the thread count and see how that works until performance is degraded again or you reach satisfactory performance.
The mds_num_threads parameter enables the number of MDS service threads to be specified at module load time on the MDS node:
options mds mds_num_threads={N}
After startup, the minimum and maximum number of MDS thread counts can be set via the {service}.thread_{min,max,started} tunable. To change the tunable at runtime, run:
lctl {get,set}_param {service}.thread_{min,max,started}
For details, see Setting MDS and OSS Thread Counts.
At this time, no testing has been done to determine the optimal number of MDS threads. The default value varies, based on server size, up to a maximum of 32. The maximum number of threads (MDS_MAX_THREADS) is 512.
This section describes LNET tunables.
With Lustre release 1.4.7 and later, ksocklnd now has separate parameters for the transmit and receive buffers.
options ksocklnd tx_buffer_size=0 rx_buffer_size=0
If these parameters are left at the default value (0), the system automatically tunes the transmit and receive buffer size. In almost every case, this default produces the best performance. Do not attempt to tune these parameters unless you are a network expert.
By default, this parameter is on. In the normal case of an SMP system, we would like network traffic to remain local to a single CPU. This helps to keep the processor cache warm and minimizes the impact of context switches. This is especially helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs.
If you have an SMP platform with a single fast interface such as 10GB Ethernet and more than two CPUs, you may see performance improve by turning this parameter off. As always, you should test to compare the impact.
The backing file systems on an MDT and OSTs are independent of one another, so the formatting parameters for them should not be same. The size of the MDS backing file system depends solely on how many inodes you want in the total Lustre file system. It is not related to the size of the aggregate OST space.
Each time you create a file on a Lustre file system, it consumes one inode on the MDS and one inode for each OST object that the file is striped over (normally it is based on the default stripe count option -c, but this may change on a per-file basis). In ext3/ldiskfs file systems, inodes are pre-allocated, so creating a new file does not consume any of the free blocks. However, this also means that the format-time options should be conservative, as it is not possible to increase the number of inodes after the file system is formatted. It is possible to add OSTs with additional space and inodes to the file system.
To be on the safe side, plan for 4 KB per inode on the MDT. This is the default value. For the OST, the amount of space taken by each object depends entirely upon the usage pattern of the users/applications running on the system. Lustre, by necessity, defaults to a very conservative estimate for the object size (16 KB per object). You can almost always increase this value for file system installations. Many Lustre file systems have average file sizes over 1 MB per object.
When calculating the MDS size, the only important factor is the average size of files to be stored in the file system. If the average file size is, for example, 5 MB and you have 100 TB of usable OST space, then you need at least (100 TB * 1024 GB/TB * 1024 MB/GB / 5 MB/inode) = 20 million inodes. We recommend that you have twice the minimum, that is, 40 million inodes in this example. At the default 4 KB per inode, this works out to only 160 GB of space for the MDS.
Conversely, if you have a very small average file size, 4 KB for example, Lustre is not very efficient. This is because you consume as much space on the MDS as on the OSTs. This is not a very common configuration for Lustre.
To override the default formatting options for any of the Lustre backing file systems, use the --mkfsoptions='backing fs options' argument to mkfs.lustre to pass formatting options to the backing mkfs. For all options to format backing ext3 and ldiskfs filesystems, see the mke2fs(8) man page; this section only discusses several Lustre-specific options.
The number of inodes on the MDS is determined at format time based on the total size of the file system to be created. The default MDS inode ratio is one inode for every 4096 bytes of file system space. To override the inode ratio, use the option -i <bytes per inode>. For example, use --mkfsoptions="-i 4096" to create one inode per 4096 bytes of file system space. Alternately, if you are specifying an absolute number of inodes, use the -N <number of inodes> option. You should not specify the -i option with an inode ratio below one inode per 1024 bytes in order to avoid unintentional mistakes. Instead, use the -N option.
For example, by default, a 2 TB MDS will have 512M inodes. The largest currently-supported file system size is 16 TB, which would hold 4B inodes, the maximum possible number of inodes with ldiskfs. With an MDS inode ratio of 1024 bytes per inode, a 2 TB MDS would hold 2B inodes, and a 4 TB MDS would hold 4B inodes, which is the maximum number of inodes currently supported by ext3.
Lustre uses "large" inodes on backing file systems to efficiently store Lustre metadata with each file. On the MDS, each inode is at least 512 bytes in size (by default), while on the OST each inode is 256 bytes in size. Lustre (or more specifically the backing ext3 file system), also needs sufficient space left for other metadata like the journal (up to 400 MB), bitmaps and directories. There are also a few regular files that Lustre uses to maintain cluster consistency.
To specify a larger inode size, use the -I <inodesize> option. We do NOT recommend specifying a smaller-than-default inode size, as this can lead to serious performance problems; and you cannot change this parameter after formatting the file system. The inode ratio must always be larger than the inode size.
For OST file systems, it is normally advantageous to take local file system usage into account. Try to minimize the number of inodes on each OST, while keeping enough margin for potential variance in future usage. This helps reduce the format and e2fsck time, and makes more space available for data. The current default is to create one inode per 16 KB of space in the OST file system, but in many environments, this is far too many inodes for the average file size. As a good rule of thumb, the OSTs should have at least:
num_ost_inodes = 4 * <num_mds_inodes> * <default_stripe_count> / <number_osts>
You can specify the number of inodes on the OST file systems via the -N<num_inodes> option to --mkfs options. Alternately, if you know the average file size, then you can also specify the OST inode count for the OST file systems via -i <average_file_size / (number_of_stripes * 4)>. For example, if the average file size is 16 MB and there are, by default 4 stripes per file, then --mkfsoptions='-i 1048576' would be appropriate.
For more details on formatting MDT and OST file systems, see Formatting Options for RAID Devices.
This section only applies to Cray XT3 Catamount nodes, and explains parameters used with the kptllnd module. If it does not apply to your setup, ignore it.
With a large number of clients and servers possible on these systems, tuning various request pools becomes important. We are making changes to the ptllnd module.
The lockless I/O tunable feature allows servers to ask clients to do lockless I/O (liblustre-style where the server does the locking) on contended files.
The lockless I/O patch introduces these tunables:
/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
contended_locks - If the number of lock conflicts in the scan of granted and waiting queues at contended_locks is exceeded, the resource is considered to be contended.
contention_seconds - The resource keeps itself in a contended state as set in the parameter.
max_nolock_bytes - Server-side locking set only for requests less than the blocks set in the max_nolock_bytes parameter. If this tunable is set to zero (0), it disables server-side locking for read/write requests.
/proc/fs/lustre/llite/lustre-*
contention_seconds - llite inode remembers its contended state for the time specified in this parameter.
The /proc/fs/lustre/llite/lustre-*/stats file has new rows for lockless I/O statistics.
lockless_read_bytes and lockless_write_bytes - To count the total bytes read or written, the client makes its own decisions based on the request size. The client does not communicate with the server if the request size is smaller than the min_nolock_size, without acquiring locks by the client.
To avoid the risk of data corruption on the network, a Lustre client can perform end-to-end data checksums[1]. Be aware that at high data rates, checksumming can impact Lustre performance.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.