C H A P T E R  22

Lustre Monitoring

This chapter provides information on monitoring Lustre and includes the following sections:


22.1 Lustre Monitoring Tool

The Lustre Monitoring Tool (LMT[1]) is a Python-based, distributed system that provides a ''top'' like display of activity on server-side nodes[2] (MDS, OSS and portals routers) on one or more Lustre file systems. For more information on LMT, including the setup procedure, see:

http://code.google.com/p/lmt/

LMT questions can be directed to:

lmt-discuss@googlegroups.com


22.2 Red Hat Cluster Manager

The Red Hat Cluster Manager provides high availability features that are essential for data integrity, application availability and uninterrupted service under various failure conditions. You can use the Cluster Manager to test MDS/OST failure in Lustre clusters.

To use Cluster Manager to test MDS failover, specific hardware is required - a compute node, OSTs and two machines (to act as the active and failover MDSs). The MDS nodes need to be able to see the same shared storage, so you need to prepare a shared disk for the Cluster Manager and the MDSs. Several RPM packages are also required[3], along with certain configuration changes.

For more information on the Cluster Manager (bundled in the Red Hat Cluster Suite), see the Red Hat Cluster Suite. Supporting documentation is available to the Red Hat Cluster Suite Overview.

For more information on installing and configuring Cluster Manager for Lustre failover, and testing MDS failover, see Cluster Manager.


22.3 SNMP Monitoring

Lustre has a native SNMP module, which enables you to use various standard SNMP monitoring packages (anything using RRDTool as a backend) to track performance. For more information in installing, building and using the SNMP module, see Lustre SNMP Module.


22.4 CollectL

CollectL is another tool that can be used to monitor Lustre. You can run CollectL on a Lustre system that has any combination of MDSs, OSTs and clients. The collected data can be written to a file for continuous logging and played back at a later time. It can also be converted to a format suitable for plotting.

For more information about CollectL, see:

http://collectl.sourceforge.net

Lustre-specific documentation is also available. See:

http://collectl.sourceforge.net/Tutorial-Lustre.html


22.5 Other Monitoring Options

Another option is to script a simple monitoring solution which looks at various reports from ipconfig, as well as the procfs files generated by Lustre.

 


1 (Footnote) LMT was developed by Lawrence Livermore National Lab (LLNL) and continues to be maintained by LLNL.
2 (Footnote) Lustre client monitoring is not supported.
3 (Footnote) The Lustre Group has made several scripts available for MDS failover testing.