Infiniband Configuration Howto
This howto covers what is needed to handle working with varying different levels of functionality of the InfiniBand stack. Like Lustre the InfiniBand software stack has been under going large internal API changes and the impact of those changes on the different InfiniBand hardware drivers have varied. Several different API's have appeared in the InfiniBand layer. Previously lustre supported two of those APIs called PMR and FMR. In the more recent kernel versions and the external InfiniBand stacks available the PMR functionality was removed leaving only FMR and the Fast Registration API remaining. Further plans exist to eventually remove FMR support in only then leaves the Fast Registration API remaining. With the plan of FMR removal in the future some of the InfiniBand drivers only support the Fast Registration API.
Code has been developed and merged to Lustre 2.9 to handle all these various cases just described. That work has been back ported to Lustre 2.8 but to date has not been merged. If you are planning on running 2.8.0 and have a InfiniBand fabric that contains different flavors of hardware that behave differently then you will need to for now download three different patches to apply them to the Lustre 2.8 source tree. They are:
One of the more common cases sites are running into is having different types of Mellanox hardware in the same network. Currently some Mellanox hardware requires running the mlx4 kernel driver and newer pieces of hardware will depend on the mlx5 kernel driver. Each driver has very different capabilities and to make these drivers communicate on the Lustre level requires some tuning. The first difference you will notice with the mlx5 kernel driver is that the largest value the peer credits for the ko2iblnd driver can be set to is 16. This is due to the queue pair depth being smaller for the mlx5 kernel driver when compared to the mlx4 kernel driver. Original a limitation for Lustre was that the peer credits for all pieces of InfiniBand hardware had to be the same in value. Today this is not the case. Unfortunately, you will find that even with Lustre support of different peer credits that if mlx4 hardware attempts to communicate with mlx5 hardware will fail. The resolution to this problem was to determine what besides peer credits influence the queue pair depth. That turned out to be the map on demand feature of the `ko2iblnd` driver. By default the map on demand feature is turned off by can be easily enabled. In my testing just setting it to 256 which is the default if enabled allowed my mlx4 hardware to communicate with the mlx5 based devices. The work done for LU-7101 also enabled configuring this setup with DLC besides the traditional setting in the lnet module configuration file. Included here is a example of an DLC configuration needed for the mlx5 hardware. The section of interest if the LND tunables section.
- net: lo nid: [email protected] status: up tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 - net: o2ib1 nid: [email protected] status: up interfaces: 0: ib0 tunables: peer_timeout: 100 peer_credits: 16 peer_buffer_credits: 0 credits: 2560 CPT: "[0,0]" LND tunables: peercredits_hiw: 31 map_on_demand: 256 concurrent_sends: 63 fmr_pool_size: 1280 fmr_flush_trigger: 1024 fmr_cache: 1
Once set up you should see success when bringing up your LNet interfaces.