Optimizing o2iblnd Performance

From Lustre Wiki
Jump to navigation Jump to search

In addition to defining the LNet interfaces, the kernel module files can be used to supply parameters to other kernel modules used by Lustre. This is commonly used to supply tuning optimizations to the LNet drivers, to maximize performance of the network interface. An example of this optimization can be seen in Lustre version 2.8.0 and later, in the file /etc/modprobe.d/ko2iblnd.conf, which includes the following:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

This configuration is automatically applied to the LNet kernel module when an Intel® Omni-Path interface is installed, but not when a different network interface is present. Please note that when a host is connected to more than one fabric sharing the same Lustre network driver, options set by modprobe will be applied to all interfaces using the same driver. To set individual, per-device tuning parameters, use the Dynamic LNet configuration utility, lnetctl, to configure the interfaces instead.

The following set of options has been defined to optimize the performance of Intel® Omni-Path Architecture. A detailed description is beyond the scope of this exercise, but the following summary provides an overview:

  • peer_credits=128 - the number of concurrent sends to a single peer
  • peer_credits_hiw=64 - High Water mark – when to eagerly return credits
  • credits=1024 - the number of concurrent sends (to all peers)
  • concurrent_sends=256 - send work-queue sizing
  • ntx=2048 - the number of message descriptors that are pre-allocated when the ko2iblnd module is loaded in the kernel
  • map_on_demand=32 - the number of noncontiguous memory regions that will be mapped into a virtual contiguous region
  • fmr_pool_size=2048 - the size of the Fast Memory registration (FMR) pool (must be >= ntx/4)
  • fmr_flush_trigger=512 - the dirty FMR pool flush trigger
  • fmr_cache=1 - enable FMR caching
  • conns_per_peer=4 - create multiple queue pairs per peer to allow higher throughput from a single client. This is of most benefit to OPA interfaces, when coupled with the krcvqs parameter of the OPA hfi1 kernel driver. The hfi1 driver option krcvqs must also be set. It is recommended to set krcvqs=4. In some cases, setting krcvqs=8 will yield improved IO performance, but this can impact other workloads, especially on clients. If queue-pair memory usage becomes excessive, reduce the ko2iblnd conns_per_peer value to 2 and krcvqs=2.

The default values used by Lustre if no parameters are given is:

  • peer_credits=8
  • peer_credits_hiw=8
  • concurrent_sends=8
  • credits=64

Optimizations are applied automatically on detection of an Intel® high performance network interface. Some of the parameters, such as FMR, are incompatible with other devices, such as Mellanox InfiniBand products using the MLX5 driver. It can be disabled by setting map_on_demand=0 (the default). The configuration file can be modified or deleted to meet the specific requirements of a given installation.

In general, the default ko2iblnd settings work well with Mellanox InfiniBand HCAs and no tuning is normally required. Architecture differences between Intel® fabrics and Mellanox mean that setting universal defaults is very difficult. Intel® OPA and Intel® True Scale Fabric have an architecture that favors lightweight, high-frequency message-passing communications, compared to Mellanox, which has historically placed an emphasis on throughput-oriented workloads. Because Mellanox InfiniBand has historically been the dominant high-speed fabric, LNet driver development has naturally tended in the past to align with this technology, aided by interfaces that are intended to support storage-like workloads. What the above settings do is tune the LNet driver for communications on Intel® fabrics, if present.

Note: It is possible to use the socklnd driver on RDMA fabrics if there is an upper-level protocol that supports TCP/IP traffic, such as the IPoIB driver for InfiniBand fabrics. This use of socklnd on InfiniBand, RoCE, and Intel® OPA networks is not recommended because it will compromise the performance of LNet compared to the RDMA-based o2iblnd, and can have a negative impact on the stability of the resulting network connection. Instead, it is strongly recommended that o2iblnd is used wherever possible; it provides the highest performance with the lowest overheads on these fabrics.

Additional Intel® Omni-Path Optimization

Intel makes the following recommendations for OPA hfi1 driver options for use with Lustre:

options hfi1 krcvqs=4 piothreshold=0 sge_copy_mode=2 wss_threshold=70

Some experimentation with the krcvqs parameter may be required to find the optimal balance of Lustre IO performance against other workloads. Lustre servers may derive additional performance from increasing the value up to 8. For Lustre clients, higher values can improve Lustre performance but might impact application performance.

See also:

irqbalance

The purpose of irqbalance is to distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. According to the Intel Omni-Path tuning guide (download pdf bundle), setting the irqbalance hint policy to exact can be beneficial to the hfi1 receive and send DMA interrupt algorithms in the driver.

To install the irqbalance package, run the following command:

yum -y install irqbalance

Once installed, edit /etc/sysconfig/irqbalance, and add the following line:


IRQBALANCE_ARGS=--hintpolicy=exact

Enable the irqbalance service and reload the configuration (make sure the HFI1 driver is loaded first):

systemctl restart irqbalance.service