Infiniband Configuration Howto: Difference between revisions
(Add categories) |
m (fix spelling, URL) |
||
Line 2: | Line 2: | ||
This howto covers what is needed to handle working with varying different levels | This howto covers what is needed to handle working with varying different levels | ||
of functionality of the | of functionality of the InfiniBand stack. Like Lustre the InfiniBand software | ||
stack has been under going large internal API changes and the impact of those | stack has been under going large internal API changes and the impact of those | ||
changes on the different | changes on the different InfiniBand hardware drivers have varied. Several | ||
different API's have appeared in the | different API's have appeared in the InfiniBand layer. Previously lustre supported | ||
two of those APIs called PMR and FMR. In the more recent kernel versions and | two of those APIs called PMR and FMR. In the more recent kernel versions and | ||
the external | the external InfiniBand stacks available the PMR functionality was removed | ||
leaving only FMR and the Fast Registration API remaining. Further plans exist to | leaving only FMR and the Fast Registration API remaining. Further plans exist to | ||
eventually remove FMR support in only then leaves the Fast Registration API | eventually remove FMR support in only then leaves the Fast Registration API | ||
remaining. With the plan of FMR removal in the future some of the | remaining. With the plan of FMR removal in the future some of the InfiniBand | ||
drivers only support the Fast Registration API. | drivers only support the Fast Registration API. | ||
Line 16: | Line 16: | ||
Code has been developed and merged to Lustre 2.9 to handle all these various cases | Code has been developed and merged to Lustre 2.9 to handle all these various cases | ||
just described. That work has been back ported to Lustre 2.8 but to date has not been | just described. That work has been back ported to Lustre 2.8 but to date has not been | ||
merged. If you are planning on running 2.8.0 and have a | merged. If you are planning on running 2.8.0 and have a InfiniBand fabric that contains | ||
different flavors of hardware that behave differently then you will need to for now | different flavors of hardware that behave differently then you will need to for now | ||
download three different patches to apply them to the Lustre 2.8 source tree. They are: | download three different patches to apply them to the Lustre 2.8 source tree. They are: | ||
Line 30: | Line 30: | ||
hardware requires running the mlx4 kernel driver and newer pieces of | hardware requires running the mlx4 kernel driver and newer pieces of | ||
hardware will depend on the mlx5 kernel driver. Each driver has very | hardware will depend on the mlx5 kernel driver. Each driver has very | ||
different | different capabilities and to make these drivers communicate on the | ||
Lustre level requires some tuning. The first difference you will notice | Lustre level requires some tuning. The first difference you will notice | ||
with the mlx5 kernel driver is that the largest value the peer credits | with the mlx5 kernel driver is that the largest value the peer credits | ||
Line 36: | Line 36: | ||
pair depth being smaller for the mlx5 kernel driver when compared to | pair depth being smaller for the mlx5 kernel driver when compared to | ||
the mlx4 kernel driver. Original a limitation for Lustre was that the | the mlx4 kernel driver. Original a limitation for Lustre was that the | ||
peer credits for all pieces of | peer credits for all pieces of InfiniBand hardware had to be the same | ||
in value. Today this is not the case. | in value. Today this is not the case. Unfortunately, you will find that | ||
even with Lustre support of different peer credits that if mlx4 | even with Lustre support of different peer credits that if mlx4 | ||
hardware attempts to communicate with mlx5 hardware will fail. The | hardware attempts to communicate with mlx5 hardware will fail. The | ||
resolution to this problem was to determine what besides peer credits | resolution to this problem was to determine what besides peer credits | ||
influence the queue pair depth. That turned out to be the map on | influence the queue pair depth. That turned out to be the map on | ||
demand feature of the ko2iblnd driver. By default the map on demand | demand feature of the `ko2iblnd` driver. By default the map on demand | ||
feature is turned off by can be easily enabled. In my testing just | feature is turned off by can be easily enabled. In my testing just | ||
setting it to 256 which is the default if enabled allowed my mlx4 | setting it to 256 which is the default if enabled allowed my mlx4 | ||
hardware to communicate with the mlx5 based devices. The work | hardware to communicate with the mlx5 based devices. The work | ||
done for LU-7101 also enabled configuring this setup with DLC | done for [https://jira.whamcloud.com/browse/LU-7101 LU-7101] | ||
besides the | also enabled configuring this setup with DLC | ||
besides the traditional setting in the lnet module configuration | |||
file. Included here is a example of an DLC configuration needed | file. Included here is a example of an DLC configuration needed | ||
for the mlx5 hardware. The section of interest if the LND tunables | for the mlx5 hardware. The section of interest if the LND tunables |
Latest revision as of 15:51, 13 November 2020
Introduction
This howto covers what is needed to handle working with varying different levels of functionality of the InfiniBand stack. Like Lustre the InfiniBand software stack has been under going large internal API changes and the impact of those changes on the different InfiniBand hardware drivers have varied. Several different API's have appeared in the InfiniBand layer. Previously lustre supported two of those APIs called PMR and FMR. In the more recent kernel versions and the external InfiniBand stacks available the PMR functionality was removed leaving only FMR and the Fast Registration API remaining. Further plans exist to eventually remove FMR support in only then leaves the Fast Registration API remaining. With the plan of FMR removal in the future some of the InfiniBand drivers only support the Fast Registration API.
Lustre support
Code has been developed and merged to Lustre 2.9 to handle all these various cases just described. That work has been back ported to Lustre 2.8 but to date has not been merged. If you are planning on running 2.8.0 and have a InfiniBand fabric that contains different flavors of hardware that behave differently then you will need to for now download three different patches to apply them to the Lustre 2.8 source tree. They are:
http://review.whamcloud.com/#/c/17606
http://review.whamcloud.com/#/c/18347
http://review.whamcloud.com/#/c/16367
Mellanox Hardware
One of the more common cases sites are running into is having different types of Mellanox hardware in the same network. Currently some Mellanox hardware requires running the mlx4 kernel driver and newer pieces of hardware will depend on the mlx5 kernel driver. Each driver has very different capabilities and to make these drivers communicate on the Lustre level requires some tuning. The first difference you will notice with the mlx5 kernel driver is that the largest value the peer credits for the ko2iblnd driver can be set to is 16. This is due to the queue pair depth being smaller for the mlx5 kernel driver when compared to the mlx4 kernel driver. Original a limitation for Lustre was that the peer credits for all pieces of InfiniBand hardware had to be the same in value. Today this is not the case. Unfortunately, you will find that even with Lustre support of different peer credits that if mlx4 hardware attempts to communicate with mlx5 hardware will fail. The resolution to this problem was to determine what besides peer credits influence the queue pair depth. That turned out to be the map on demand feature of the `ko2iblnd` driver. By default the map on demand feature is turned off by can be easily enabled. In my testing just setting it to 256 which is the default if enabled allowed my mlx4 hardware to communicate with the mlx5 based devices. The work done for LU-7101 also enabled configuring this setup with DLC besides the traditional setting in the lnet module configuration file. Included here is a example of an DLC configuration needed for the mlx5 hardware. The section of interest if the LND tunables section.
net:
- net: lo nid: 0@lo status: up tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 - net: o2ib1 nid: 10.0.0.1@o2ib1 status: up interfaces: 0: ib0 tunables: peer_timeout: 100 peer_credits: 16 peer_buffer_credits: 0 credits: 2560 CPT: "[0,0]" LND tunables: peercredits_hiw: 31 map_on_demand: 256 concurrent_sends: 63 fmr_pool_size: 1280 fmr_flush_trigger: 1024 fmr_cache: 1
Once set up you should see success when bringing up your LNet interfaces.