Infiniband Configuration Howto: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(Add categories)
m (fix spelling, URL)
 
Line 2: Line 2:


This howto covers what is needed to handle working with varying different levels
This howto covers what is needed to handle working with varying different levels
of functionality of the infiniband stack. Like Lustre the infiniband software
of functionality of the InfiniBand stack. Like Lustre the InfiniBand software
stack has been under going large internal API changes and the impact of those
stack has been under going large internal API changes and the impact of those
changes on the different infiniband hardware drivers have varied. Several
changes on the different InfiniBand hardware drivers have varied. Several
different API's have appeared in the infiniband layer. Previously lustre supported
different API's have appeared in the InfiniBand layer. Previously lustre supported
two of those APIs called PMR and FMR. In the more recent kernel versions and
two of those APIs called PMR and FMR. In the more recent kernel versions and
the external infiniband stacks available the PMR functionality was removed
the external InfiniBand stacks available the PMR functionality was removed
leaving only FMR and the Fast Registration API remaining. Further plans exist to
leaving only FMR and the Fast Registration API remaining. Further plans exist to
eventually remove FMR support in only then leaves the Fast Registration API
eventually remove FMR support in only then leaves the Fast Registration API
remaining. With the plan of FMR removal in the future some of the infiniband
remaining. With the plan of FMR removal in the future some of the InfiniBand
drivers only support the Fast Registration API.  
drivers only support the Fast Registration API.  


Line 16: Line 16:
Code has been developed and merged to Lustre 2.9 to handle all these various cases  
Code has been developed and merged to Lustre 2.9 to handle all these various cases  
just described. That work has been back ported to Lustre 2.8 but to date has not been
just described. That work has been back ported to Lustre 2.8 but to date has not been
merged. If you are planning on running 2.8.0 and have a infiniband fabric that contains
merged. If you are planning on running 2.8.0 and have a InfiniBand fabric that contains
different flavors of hardware that behave differently then you will need to for now  
different flavors of hardware that behave differently then you will need to for now  
download three different patches to apply them to the Lustre 2.8 source tree. They are:
download three different patches to apply them to the Lustre 2.8 source tree. They are:
Line 30: Line 30:
hardware requires running the mlx4 kernel driver and newer pieces of
hardware requires running the mlx4 kernel driver and newer pieces of
hardware will depend on the mlx5 kernel driver. Each driver has very
hardware will depend on the mlx5 kernel driver. Each driver has very
different compabilities and to make these drivers communicate on the
different capabilities and to make these drivers communicate on the
Lustre level requires some tuning. The first difference you will notice
Lustre level requires some tuning. The first difference you will notice
with the mlx5 kernel driver is that the largest value the peer credits
with the mlx5 kernel driver is that the largest value the peer credits
Line 36: Line 36:
pair depth being smaller for the mlx5 kernel driver when compared to
pair depth being smaller for the mlx5 kernel driver when compared to
the mlx4 kernel driver. Original a limitation for Lustre was that the
the mlx4 kernel driver. Original a limitation for Lustre was that the
peer credits for all pieces of inifinband hardware had to be the same
peer credits for all pieces of InfiniBand hardware had to be the same
in value. Today this is not the case. Unfortunely you will find that
in value. Today this is not the case. Unfortunately, you will find that
even with Lustre support of different peer credits that if mlx4
even with Lustre support of different peer credits that if mlx4
hardware attempts to communicate with mlx5 hardware will fail. The
hardware attempts to communicate with mlx5 hardware will fail. The
resolution to this problem was to determine what besides peer credits
resolution to this problem was to determine what besides peer credits
influence the queue pair depth. That turned out to be the map on
influence the queue pair depth. That turned out to be the map on
demand feature of the ko2iblnd driver. By default the map on demand
demand feature of the `ko2iblnd` driver. By default the map on demand
feature is turned off by can be easily enabled. In my testing just
feature is turned off by can be easily enabled. In my testing just
setting it to 256 which is the default if enabled allowed my mlx4
setting it to 256 which is the default if enabled allowed my mlx4
hardware to communicate with the mlx5 based devices. The work
hardware to communicate with the mlx5 based devices. The work
done for LU-7101 also enabled configuring this setup with DLC
done for [https://jira.whamcloud.com/browse/LU-7101 LU-7101]
besides the traditonal setting in the lnet module configuration
also enabled configuring this setup with DLC
besides the traditional setting in the lnet module configuration
file. Included here is a example of an DLC configuration needed
file. Included here is a example of an DLC configuration needed
for the mlx5 hardware. The section of interest if the LND tunables
for the mlx5 hardware. The section of interest if the LND tunables

Latest revision as of 15:51, 13 November 2020

Introduction

This howto covers what is needed to handle working with varying different levels of functionality of the InfiniBand stack. Like Lustre the InfiniBand software stack has been under going large internal API changes and the impact of those changes on the different InfiniBand hardware drivers have varied. Several different API's have appeared in the InfiniBand layer. Previously lustre supported two of those APIs called PMR and FMR. In the more recent kernel versions and the external InfiniBand stacks available the PMR functionality was removed leaving only FMR and the Fast Registration API remaining. Further plans exist to eventually remove FMR support in only then leaves the Fast Registration API remaining. With the plan of FMR removal in the future some of the InfiniBand drivers only support the Fast Registration API.

Lustre support

Code has been developed and merged to Lustre 2.9 to handle all these various cases just described. That work has been back ported to Lustre 2.8 but to date has not been merged. If you are planning on running 2.8.0 and have a InfiniBand fabric that contains different flavors of hardware that behave differently then you will need to for now download three different patches to apply them to the Lustre 2.8 source tree. They are:


http://review.whamcloud.com/#/c/17606
http://review.whamcloud.com/#/c/18347
http://review.whamcloud.com/#/c/16367

Mellanox Hardware

One of the more common cases sites are running into is having different types of Mellanox hardware in the same network. Currently some Mellanox hardware requires running the mlx4 kernel driver and newer pieces of hardware will depend on the mlx5 kernel driver. Each driver has very different capabilities and to make these drivers communicate on the Lustre level requires some tuning. The first difference you will notice with the mlx5 kernel driver is that the largest value the peer credits for the ko2iblnd driver can be set to is 16. This is due to the queue pair depth being smaller for the mlx5 kernel driver when compared to the mlx4 kernel driver. Original a limitation for Lustre was that the peer credits for all pieces of InfiniBand hardware had to be the same in value. Today this is not the case. Unfortunately, you will find that even with Lustre support of different peer credits that if mlx4 hardware attempts to communicate with mlx5 hardware will fail. The resolution to this problem was to determine what besides peer credits influence the queue pair depth. That turned out to be the map on demand feature of the `ko2iblnd` driver. By default the map on demand feature is turned off by can be easily enabled. In my testing just setting it to 256 which is the default if enabled allowed my mlx4 hardware to communicate with the mlx5 based devices. The work done for LU-7101 also enabled configuring this setup with DLC besides the traditional setting in the lnet module configuration file. Included here is a example of an DLC configuration needed for the mlx5 hardware. The section of interest if the LND tunables section.

net:

   - net: lo
     nid: 0@lo
     status: up
     tunables:
         peer_timeout: 0
         peer_credits: 0
         peer_buffer_credits: 0
         credits: 0
   - net: o2ib1
     nid: 10.0.0.1@o2ib1
     status: up
     interfaces:
         0: ib0
     tunables:
         peer_timeout: 100
         peer_credits: 16
         peer_buffer_credits: 0
         credits: 2560
         CPT: "[0,0]"
     LND tunables:
         peercredits_hiw: 31
         map_on_demand: 256
         concurrent_sends: 63
         fmr_pool_size: 1280
         fmr_flush_trigger: 1024
         fmr_cache: 1

Once set up you should see success when bringing up your LNet interfaces.