LNet Router Config Guide2

This document provides procedures to configure and tune an LNet router. It will also cover detailed instructions set on setting up connectivity of an Infiniband network to Intel OPA nodes using LNet router.

LNet
LNet supports different network types like Ethernet, InfiniBand, Intel Omni-Path and other proprietary network technologies such as the Cray’s Gemini. It routes LNet messages between different LNet networks using LNet routing. LNet’s routing capabilities provide an efficient protocol to enable bridging between different types of networks. LNet is part of the Linux kernel space and allows for full RDMA throughput and zero copy communications when supported by underlying network. Lustre can initiate a multi-OST read or write using a single Remote Procedure Call (RPC), which allows the client to access data using RDMA at near peak bandwidth rates. With Multi-Rail (MR) feature implemented in Lustre 2.10.X, it allows for multiple interfaces of same type on a node to be grouped together under the same LNet (ex tcp0, o2ib0, etc.). These interfaces can then be used simultaneously to carry LNet traffic. MR also has the ability to utilize multiple interfaces configured on different networks. For example, OPA and MLX interfaces can be grouped under their respective LNet and then can be utilized with MR feature to carry LNet traffic simultaneously.

LNet Configuration Example
An LNet router is a specialized Lustre client where Lustre file system is not mounted and only the LNet is running. A single LNet router can serve different file systems.

For the above example:
 * Servers are on LAN1, a Mellanox based InfiniBand network – 10.10.0.0/24
 * Clients are LAN2, an Intel OPA network – 10.20.0.0/24
 * Routers on LAN1 and LAN2 at 10.10.0.20, 10.10.0.21 and 10.20.0.29, 10.20.0.30 respectively

The network configuration on the nodes can be done either by adding the module parameters in lustre.conf /etc/modprobe.d/lustre.conf or dynamically by using the lnetctl command utility. Also, current configuration can be exported to a YAML format file and then the configuration can be set by importing that YAML file anytime needed.

Network Configuration by adding module parameters in lustre.conf
Servers: options lnet networks="o2ib1(ib0)" routes="o2ib2 10.10.0.20@o2ib1" Routers: options lnet networks="o2ib1(ib0),o2ib2(ib1)" "forwarding=enabled" Clients: options lnet networks="o2ib2(ib0)" routes="o2ib1 10.20.0.29@o2ib2"

NOTE: Restarting LNet is necessary to apply the new configuration. To do this, it is needed to unconfigure the LNet network and reconfigure again. Make sure that the Lustre network and Lustre file system are stopped prior to unloading the modules.

// To unload and load LNet module modprobe -r lnet modprobe lnet // To unconfigure and reconfigure LNet lnetctl lnet unconfigure lnetctl lnet configure

Dynamic Network Configuration using lnetctl command
Servers: lnetctl net add --net o2ib1 --if ib0 lnetctl route add --net o2ib2 --gateway 10.10.0.20@o2ib1 lnetctl peer add --nid 10.10.0.20@o2ib1 Routers: lnetctl net add --net o2ib1 --if ib0 lnetctl net add --net o2ib2 --if ib1 lnetctl peer add --nid 10.10.0.1@o2ib1 lnetctl peer add --nid 10.20.0.1@o2ib2 lnetctl set routing 1 Clients: lnetctl net add --net o2ib2 --if ib0 lnetctl route add --net o2ib1 --gateway 10.20.0.29@o2ib2 lnetctl peer add --nid 10.20.0.29@o2ib2

Importing/Exporting configuration using a YAML file format
// To export the current configuration to a YAML file lnetctl export FILE.yaml lnetctl export > FILE.yaml // To import the configuration from a YAML file lnetctl import FILE.yaml lnetctl import < FILE.yaml

There is a default lnet.conf file installed at /etc/lnet.conf which has an example configuration in YAML format. Another example of a configuration in a YAML file is:

net: - net type: o2ib1 local NI(s): - nid: 10.10.0.1@o2ib1 status: up         interfaces: 0: ib0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 tcp bonding: 0 dev cpt: -1 CPT: "[0]" route: - net: o2ib2 gateway: 10.10.0.20@o2ib1 hop: 1 priority: 0 state: up peer: - primary nid: 10.10.0.20@o2ib1 Multi-Rail: False peer ni: - nid: 10.10.0.20@o2ib1 state: up         max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 4 global: numa_range: 0 max_intf: 200 discovery: 1

LNet provides a mechanism to monitor each route entry. LNet pings each gateway identified in the route entry on regular, configurable interval (live_router_check_interval) to ensure that it is alive. If sending over a specific route fails or if the router pinger determines that the gateway is down, then the route is marked as down and is not used. It is subsequently pinged on regular, configurable intervals (dead_router_check_interval) to determine when it becomes alive again.

Multi-Rail LNet Configuration Example
If the routers are MR enabled, we can add the routers as peers with multiple interfaces to the clients and the servers, the MR algorithm will ensure that both interfaces of the routers are used while sending traffic to the router. However, single interface failure will still cause the entire router to go down. With the network topology example in Figure 1 above, LNet MR can be configured like below:

Servers: lnetctl net add --net o2ib1 --if ib0,ib1 lnetctl route add --net o2ib2 --gateway 10.10.0.20@o2ib1 lnetctl peer add --nid 10.10.0.20@o2ib1,10.10.0.21@o2ib1 Routers: lnetctl net add --net o2ib1 --if ib0,ib1 lnetctl net add --net o2ib2 --if ib2,ib3 lnetctl peer add --nid 10.10.0.1@o2ib1,10.10.0.2@o2ib1 lnetctl peer add --nid 10.20.0.1@o2ib2,10.20.0.2@o2ib2 lnetctl set routing 1 Clients: lnetctl net add --net o2ib2 --if ib0,ib1 lnetctl route add --net o2ib1 --gateway 10.20.0.29@o2ib2 lnetctl peer add --nid 10.20.0.29@o2ib2,10.20.0.30@o2ib2

Fine-Grained Routing
The routes parameter, by identifying LNet routers in a Lustre configuration, is used to tell a node which route to use when forwarding traffic. It specifies a semi-colon-separated list of router definitions.

routes=dest_lnet [hop] [priority] router_NID@src_lnet; \ dest_lnet [hop] [priority] router_NID@src_lnet

An alternative syntax consists of a colon-separated list of router definitions:

routes=dest_lnet: [hop] [priority] router_NID@src_lnet \ [hop] [priority] router_NID@src_lnet

When there are two or more LNet routers, it is possible to give weighted priorities to each router using the priority parameter. Here are some possible reasons for using this parameter: Each router is moving traffic to a different physical location. The priority parameter is optional and need not be specified if no priority exists. The hop parameter specifies the number of hops to the destination. When a node forwards traffic, the route with the least number of hops is used. If multiple routes to the same destination network have the same number of hops, the traffic is distributed between these routes in a round-robin fashion. To reach/transmit to the LNet dest_lnet, the next hop for a given node is the LNet router with the NID router_NID in the LNet src_lnet. Given a sufficiently well-architected system, it is possible to map the flow to and from every client or server. This type of routing has also been called fine-grained routing.
 * One of the routers is more capable than the other.
 * One router is a primary router and the other is a back-up.
 * One router is for one section of clients and the other is for another section.

Advanced Routing Parameters
In a Lustre configuration where different types of LNet networks are connected by routers, several kernel module parameters can be set to monitor and improve routing performance. These parameters are set in /etc/modprobe.d/lustre.conf file. The routing related parameters are: options lnet auto_down=0 options lnet avoid_asym_router_failure=1 options lnet live_router_check_interval=50 options lnet dead_router_check_interval=50 options lnet router_ping_timeout=60 options lnet check_routers_before_use=1
 * auto_down - Enable/disable (1/0) the automatic marking of router state as up or down. The default value is 1. To disable router marking, set:
 * avoid_asym_router_failure - Specifies that if even one interface of a router is down for some reason, the entire router is marked as down. This is important because if nodes are not aware that the interface on one side is down, they will still keep pushing data to the other side presuming that the router is healthy, when it really is not. To turn it on:
 * live_router_check_interval - Specifies a time interval in seconds after which the router checker will ping the live routers. The default value is 60. To set the value to 50, use:
 * dead_router_check_interval - Specifies a time interval in seconds after which the router checker will check the dead routers. The default value is 60. To set the value to 50:
 * router_ping_timeout - Specifies a timeout for the router checker when it checks live or dead routers. The router checker sends a ping message to each dead or live router once every dead_router_check_interval or live_router_check_interval respectively. The default value is 50. To set the value to 60:
 * check_routers_before_use - Specifies that routers are to be checked before use. Set to off by default. If this parameter is set to on, the dead_router_check_interval parameter must be given a positive integer value.

The router_checker obtains the following information from each router: If the router_checker does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down. If a router that is marked “up” responds to a ping, the timeout is reset. If 100 packets have been sent successfully through a router, the sent-packets counter for that router will have a value of 100. The statistics data of an LNet router can be found from /proc/sys/lnet/stats. If no interval is specified, then statistics are sampled and printed only one time. Otherwise, statistics are sampled and printed at the specified interval (in seconds). These statistics can be displayed using lnetctl utility as well like below:
 * time the router was disabled
 * elapsed disable time

statistics: msgs_alloc: 0 msgs_max: 2 errors: 0 send_count: 887 recv_count: 887 route_count: 0 drop_count: 0 send_length: 656 recv_length: 70048 route_length: 0 drop_length: 0
 * 1) lnetctl stats show

The ping response also provides the status of the NIDs of the node being pinged. In this way, the pinging node knows whether to keep using this node as a next-hop or not. If one of the NIDs of the router is down and the avoid_asym_router_failure is set, then that router is no longer used.

LNet Dynamic Configuration
LNet can be configured dynamically using the lnetctl utility. The lnetctl utility can be used to initialize LNet without bringing up any network interfaces. This gives flexibility to the user to add interfaces after LNet has been loaded. In general the lnetctl format is as follows: lnetctl cmd subcmd [options] The following configuration items are managed by the tool:
 * Configuring/Unconfiguring LNet
 * Adding/Removing/Showing Networks
 * Adding/Removing/Showing Peers
 * Adding/Removing/Showing Routes
 * Enabling/Disabling routing
 * Configuring Router Buffer Pools

Configuring/Unconfiguring LNet
After LNet has been loaded via modprobe (modprobe lnet), the lnetctl utility can be used to configure LNet without bringing up networks that are specified in the module parameters. It can also be used to configure network interfaces specified in the module parameters by providing the --all option. // To configure LNet lnetctl lnet configure [--all] // To unconfigure LNet lnectl lnet unconfigure

Adding/Removing/Showing Networks
Now LNet is ready to be configured with networks to be added. To add an o2ib1 LNet network on ib0 and ib1 interfaces: lnetctl net add --net o2ib1 --if ib0,ib1

Using the show subcommands, it is possible to review the configuration: lnetctl net show -v net: - net type: lo     local NI(s): - nid: 0@lo status: up         statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0]" - net type: o2ib1 local NI(s): - nid: 192.168.5.151@o2ib1 status: up         interfaces: 0: ib0 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 tcp bonding: 0 dev cpt: -1 CPT: "[0]" - nid: 192.168.5.152@o2ib1 status: up         interfaces: 0: ib1 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 tcp bonding: 0 dev cpt: -1 CPT: "[0]"

To delete a net: lnetctl net del --net o2ib1 --if ib0

Resulatnt configuration would be like: lnetctl net show -v net: - net type: lo     local NI(s): - nid: 0@lo status: up         statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0]" - net type: o2ib1 local NI(s): - nid: 192.168.5.151@o2ib1 status: up         interfaces: 0: ib0 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 tcp bonding: 0 dev cpt: -1 CPT: "[0]"

Adding/Removing/Showing Peers
lnetctl command to add remote peers: lnetctl peer add --prim_nid 192.168.5.161@o2ib1 --nid 192.168.5.162@o2ib1

Verify the added peers configuration: lnetctl peer show -v peer: - primary nid: 192.168.5.161@o2ib1 Multi-Rail: True peer ni: - nid: 192.168.5.161@o2ib1 state: NA         max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 1 statistics: send_count: 2 recv_count: 2 drop_count: 0 - nid: 192.168.5.162@o2ib1 state: NA         max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 1 statistics: send_count: 1 recv_count: 1 drop_count: 0

To delete a peer: lnetctl peer del --prim_nid 192.168.5.161@o2ib1 --nid 192.168.5.162@o2ib1