Starting and Stopping LNet: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 159: Line 159:


[[Category:Lustre Networking (LNet)]]
[[Category:Lustre Networking (LNet)]]
[[Category:Lustre Systems Administration]]

Latest revision as of 22:47, 30 August 2017

LNet, like most of the services that comprise a Lustre file system, runs in the Linux kernel and is incorporated as a kernel module. LNet is started in two steps:

  1. Load the kernel modules
  2. Start the services

The lnet kernel module can be loaded directly through the modprobe command or indirectly by loading a kernel module that has a dependency on LNet. In normal operation, the lnet module will be loaded indirectly as a consequence of attempting to start a Lustre service, e.g. by mounting a file system on a client. However, one can treat LNet as independent of Lustre and start it on its own. This is useful for testing and debugging purposes, and to provide some verification of correctness when a system boots up prior to committing to loading the higher-level services (i.e. Lustre).

To load the LNet kernel module, run:

modprobe [-v] lnet

The -v flag is optional and provides verbose output, which is useful for debugging purposes but is normally omitted. For example:

[root@rh7z-pe ~]# modprobe -v lnet
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/net/lustre/libcfs.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/net/lustre/lnet.ko networks="tcp0(eth1)" 

Notice that a second module, called libcfs.ko, was also loaded. The libcfs module is an API used throughout Lustre and LNet and provides primitives for things like process management, memory management, and debugging.

After the module is loaded, the LNet service needs to be started:

lctl network up
# or
lctl network configure

The lctl network command works on all versions of Lustre, and prior to version 2.7, it is the only way to manually start LNet. In Lustre 2.7 and onward, there is also the lnetctl utility:

lnetctl lnet configure [--all]

The lnetctl configure command will not automatically configure networks that are specified in the kernel module parameters; the lnet service will start, but the interfaces will not be configured. Supplying the --all flag will cause all of the networks defined as kernel module options to be loaded and started.

To view the loaded configuration:

lctl list_nids

# or for dynamic lnet in Lustre 2.7+
lnetctl net show [--verbose]
lnetctl export

The lnetctl export command is equivalent to lnetctl net show --verbose.

To shut down LNet and unload the kernel modules, first stop the LNet networks on the host:

lctl network down
# or 
lctl network unconfigure

Then use the lustre_rmmod command to unload the kernel modules:

lustre_rmmod

One can unload the module by using rmmod directly:

rmmod lnet
rmmod libcfs

lustre_rmmod is the recommended method for unloading Lustre and LNet kernel modules, because it will check for dependencies and eliminates any guesswork on the part of the systems administrator in trying to identify all of the modules to unload and the correct sequence for doing so.

LNet can also be loaded indirectly, as a dependency of the lustre kernel module. If LNet is loaded in this way, its start-up behavior is different because the LNet networks defined in kernel module options will be automatically configured and brought online. This is easily illustrated just by loading the Lustre module:

[root@rh7z-pe ~]# modprobe -v lustre
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/net/lustre/libcfs.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/net/lustre/lnet.ko networks="tcp0(eth1)" 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/obdclass.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/ptlrpc.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/fld.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/fid.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/lov.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/mdc.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/lmv.ko 
insmod /lib/modules/3.10.0-327.13.1.el7_lustre.x86_64/extra/kernel/fs/lustre/lustre.ko

Notice that the lnet.ko module is loaded as a dependency. The console and kernel ring buffer output will look something like this:

[266699.213610] LNet: HW CPU cores: 2, npartitions: 1
[266699.232630] alg: No test for adler32 (adler32-zlib)
[266699.234184] alg: No test for crc32 (crc32-table)
[266707.286906] Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=server,distro=el7,ib_stack=inkernel-40--PRISTINE-3.10.0-327.13.1.el7_lustre.x86_64
[266707.338890] LNet: Added LNI 192.168.207.2@tcp [8/256/0/180]
[266707.339851] LNet: Accept secure, port 988

As can be seen in the above output, the LNet networks were automatically loaded.

The lustre_rmmod behavior is also different in this circumstance, compared to loading LNet on its own. If the administrator loads and configures LNet on its own, independently of the Lustre module, then it is necessary to unconfigure the LNet networks before removing the kernel modules:

[root@rh7z-pe ~]# modprobe lnet
[root@rh7z-pe ~]# lctl network up
LNET configured
[root@rh7z-pe ~]# lctl list_nids
192.168.207.2@tcp
[root@rh7z-pe ~]# lustre_rmmod
Modules still loaded: 
lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
[root@rh7z-pe ~]# lctl network down
LNET ready to unload
[root@rh7z-pe ~]# lustre_rmmod
[root@rh7z-pe ~]# lsmod |grep lnet

However, if the lnet module is loaded indirectly, as a dependency of the Lustre kernel module, then lustre_rmmod will gracefully unload all modules including lnet:

[root@rh7z-pe ~]# modprobe lustre
[root@rh7z-pe ~]# lctl list_nids
192.168.207.2@tcp
[root@rh7z-pe ~]# lustre_rmmod
[root@rh7z-pe ~]# lsmod | grep -E lnet\|lustre

This behavior is consistent, but not entirely intuitive. The reason for this behavior has to do with a special function of LNet: routing. LNet routing enables a node that is connected to more than one LNet fabric to route traffic between the networks. LNet routing is a complex topic and is not discussed in this article. For more information on LNet routers, see:

http://www.intel.com/content/www/us/en/software/configuring-lnet-routers-file-systems-lustre-guide.html

Because routing is a function of the network, not of the Lustre file system itself, lustre_rmmod will effectively assume that if a host has only the lnet module loaded and running, then it is providing routing services. lustre_rmmod will therefore refuse to unload the modules unless the lnet service is explicitly unconfigured.

If, on the other hand, the lustre kernel module is also loaded, and there are no file systems mounted, then lustre_rmmod will assume that the host is either an idle server or client and will unload the entire stack, including the lnet modules.

If a Lustre OSD is mounted on a host, then the lustre_rmmod command will not unload the Lustre kernel modules and will report an error:

[root@rh7z-mds1 ~]# df -ht lustre
File system      Size  Used Avail Use% Mounted on
mgspool/mgt     960M  2.2M  956M   1% /lfs/mgt

[root@rh7z-mds1 ~]# lustre_rmmod
  0 UP osd-zfs MGS-osd MGS-osd_UUID 5
  1 UP mgs MGS MGS 7
  2 UP mgc MGC192.168.227.11@tcp1 c2108a9c-a62f-6626-48e4-68f1caf1bce3 5
Modules still loaded: 
lustre/mgs/mgs.o lustre/mgc/mgc.o lustre/quota/lquota.o lustre/fid/fid.o lustre/fld/fld.o lnet/klnds/socklnd/ksocklnd.o lustre/ptlrpc/ptlrpc.o lustre/obdclass/obdclass.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o

[root@rh7z-mds1 ~]# df -ht lustre
File system      Size  Used Avail Use% Mounted on
mgspool/mgt     960M  2.2M  956M   1% /lfs/mgt

From this example, it can be seen that because the MGS is mounted, lustre_rmmod takes no action to remove the kernel modules. Instead, it shows that there are active services running on the host and exits. The MGT is still mounted and the MGS is running. The lustre_rmmod command is a very useful tool for ensuring the correct and safe unloading of Lustre kernel modules.