http://wiki.lustre.org/api.php?action=feedcontributions&user=Sknolin&feedformat=atomLustre Wiki - User contributions [en]2024-03-29T10:33:59ZUser contributionsMediaWiki 1.39.3http://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=4592Lustre Monitoring and Statistics Guide2022-10-06T15:03:35Z<p>Sknolin: </p>
<hr />
<div>== Introduction ==<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information was originally based on working with Lustre 2.4 and 2.5. The same metrics are available in 2.10.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They are formatted in YAML, which looks like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
*Integrated Manager for Lustre - With version 4.0, IML is FOSS software. This can be installed in a monitoring only mode. https://github.com/whamcloud/integrated-manager-for-lustre<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== Prometheus Exporters ====<br />
<br />
* https://github.com/HewlettPackard/lustre_exporter - this appears inactive (10/2022)<br />
* https://github.com/GSI-HPC/lustre_exporter - branch of HP exporter, more recent changes<br />
* https://github.com/whamcloud/lustrefs-exporter - reinventing things with Rust instead of Go?<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=4591Lustre Monitoring and Statistics Guide2022-10-06T14:59:50Z<p>Sknolin: No need for my name on this any more I think, as people have added some things.</p>
<hr />
<div>== Introduction ==<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information was originally based on working with Lustre 2.4 and 2.5. The same metrics are available in 2.10.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They are formatted in YAML, which looks like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
*Integrated Manager for Lustre - With version 4.0, IML is FOSS software. This can be installed in a monitoring only mode. https://github.com/whamcloud/integrated-manager-for-lustre<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=3439Lustre Monitoring and Statistics Guide2018-11-15T23:44:55Z<p>Sknolin: note that metrics work in 2.10</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information was originally based on working with Lustre 2.4 and 2.5. The same metrics are available in 2.10.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They are formatted in YAML, which looks like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
*Intel Manager for Lustre* - With version 4.0, IML is FOSS software. This can be installed in a monitoring only mode. https://github.com/intel-hpdd/Online-Help<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=2972Lustre Monitoring and Statistics Guide2017-11-06T19:01:59Z<p>Sknolin: /* Open Source Monitoring Packages */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
*Intel Manager for Lustre* - With version 4.0, IML is FOSS software. This can be installed in a monitoring only mode. https://github.com/intel-hpdd/Online-Help<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Talk:LNet_Router_Config_Guide&diff=2944Talk:LNet Router Config Guide2017-10-23T14:26:09Z<p>Sknolin: kudos and discussion of removing another page</p>
<hr />
<div>This information is great, configuring LNET routers has felt like some kind of secret for years.<br />
<br />
I feel we should get rid of [[LNET Router]], but think we should include the link to the LNET Router talk from Fragalla somewhere, maybe end of this page in a 'links' section?</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=2216Lustre Monitoring and Statistics Guide2017-01-31T16:45:45Z<p>Sknolin: /* check_mk and Graphite */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=2215Lustre Monitoring and Statistics Guide2017-01-31T16:38:46Z<p>Sknolin: /* References and Links */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
* Lustre and associated scripts used by SSEC - http://www.ssec.wisc.edu/~scottn/files/<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=2214Lustre Monitoring and Statistics Guide2017-01-31T16:37:04Z<p>Sknolin: /* Perl and Graphite */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* http://www.ssec.wisc.edu/~scottn/files/Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Check_MK/Graphite/Graphios_Setup_Guide&diff=1716Check MK/Graphite/Graphios Setup Guide2016-05-21T01:08:46Z<p>Sknolin: Andrew not at SSEC, remove invalid email</p>
<hr />
<div>== Introduction ==<br />
This guide will take the user step-by-step through the Lustre Monitoring deployment that the Space Science and Engineering Center uses for monitoring all of its Lustre file systems. The author of this guide is Andrew Wagner. Where possible, I have linked to our production configuration files for software to give readers a good idea of the possible settings they can or should use for their own setups.<br />
<br />
== Hardware Requirements ==<br />
<br />
Any existing server can be used for a proof of concept version of this guide. The requirements for several thousands checks per minute are low - a small VM can easily handle the load.<br />
<br />
Our productions server can easily handle ~150k checks per minute and from a processing/disk I/O perspective can handle much more. Here are the specs:<br />
<br />
*Dell PowerEdge R515<br />
**2x 8-Core AMD Opteron 4386<br />
**300GB RAID1 15K SAS<br />
**200GB Enterprise SSD<br />
**64GB RAM<br />
<br />
== Software Requirements ==<br />
<br />
*Centos 6 x86_64<br />
*Centos 6 EPEL Repository<br />
*Configuration Management System (Puppet, Ansible, Salt, Chef, etc)<br />
<br />
== Notes on Scaling/Size of Metrics ==<br />
Here at SSEC, we are collecting ~200k metrics per minute. The setup that we have could be scaled with minimal effort to a large size. Several million metrics per minute is not out of the question. However, if you have tens of millions of metrics collected per minute, this approach will not likely work for you.<br />
<br />
== What will the final product look like? ==<br />
<br />
Before embarking a deploying this infrastructure, take a look at some example dashboards that we generated with our Grafana instance. These are not Lustre specific but show some finished products.<br />
<br />
*MDF Switch for SSEC<br />
http://snapshot.raintank.io/dashboard/snapshot/lw0eZBCgUwHtZ2hEtF1At82c6bpl443l<br />
<br />
*Datacenter Coolers<br />
http://snapshot.raintank.io/dashboard/snapshot/WXZG341nFdmWdcoEovHBPCWYmbEeumDv<br />
<br />
*Single Host<br />
http://snapshot.raintank.io/dashboard/snapshot/3SaSSlqEGyO9IjrfGru4V0nebn5TdiaD<br />
<br />
== Building the Lustre Monitoring Deployment ==<br />
<br />
=== Setting up an OMD Monitoring Server ===<br />
<br />
The first thing that we needed for our new monitoring deployment was a monitoring server. We were already using Check_MK with Nagios on our older monitoring server but the Open Monitoring Distribution nicely ties all of the components together. The distribution is available at http://omdistro.org/ and installs via RPM.<br />
<br />
On a newly deployed Centos6 machine, I installed the OMD-1.20 RPM. This takes care of all of the work of installing Nagios, Check_MK, PNP4Nagios, etc.<br />
<br />
After installation, I created the new OMD monitoring site:<br />
<br />
<code>omd create ssec</code><br />
<br />
This creates a new site that runs its own stack of Apache, Nagios, Check_MK and everything else in the OMD distribution. Now we can start the site:<br />
<br />
<code>omd start ssec</code><br />
<br />
You can now nagivate to http://example.fqdn.com/sitename of your server, i.e. http://example.ssec.wisc.edu/ssec and login with the default OMD credentials.<br />
<br />
We chose to setup LDAPS authentication versus our Active Directory server to manage authentication. There is a good discussion of how to do this here:<br />
https://mathias-kettner.de/checkmk_multisite_ldap_integration.html<br />
<br />
Additionally, we setup HTTPS for our web access to OMD:<br />
http://lists.mathias-kettner.de/pipermail/checkmk-en/2014-May/012225.html<br />
<br />
At this point, you can start configuring your monitoring server to monitor hosts! Check_MK has a lot of configuration options, but it's a lot better than managing Nagios configurations by hand. Fortunately, Check_MK is widely used and well documented. The Check_MK documentation root is available at http://mathias-kettner.de/checkmk.html. <br />
<br />
=== Deploying Agents to Lustre Hosts ===<br />
<br />
To operate, the Check_MK agent on hosts runs as an xinetd service with a config file at /etc/xinetd.d/check_mk. That file includes the IP addresses allowed to access the agent in the '''only_from''' parameter. The OMD distribution comes with Check_MK agent RPMs. I rebuilt the RPM using rpmrebuild to include our updated IP addresses for our monitoring servers.<br />
<br />
After rebuilding the RPM, push out the RPM to all hosts that will be monitored. We use a custom repository and Puppet for managing our existing software, so adding the RPM to the repo and pushing out via Puppet can be done with a simple module.<br />
<br />
After deployment, we can verify the agents work by adding them to Check_MK via the GUI or configuration file and inventorying them. This will allow us to monitor a wide array of default metrics such as CPU Load, CPU Utilization, Memory use, and many others.<br />
<br />
=== Writing Local Checks to Run via Check_MK Agent ===<br />
<br />
Now that the Check_MK agents are deployed to the Lustre servers, we can add Check_MK local agent checks to measure whatever we want. The documentation for local checks is here: http://mathias-kettner.de/checkmk_localchecks.html.<br />
<br />
The output of the check has to have a Nagios status number, Name, Performance Data, and Check Output.<br />
<br />
Check out the examples in the Check_MK documentation for formatting of output. You can use whatever language your server supports to execute the local check. At SSEC, Scott Nolin has implemented several Perl scripts to poll Lustre statistics and output in the Check_MK format. You can read more about the checks here:<br />
http://wiki.opensfs.org/Lustre_Statistics_Guide.<br />
<br />
=== Check_MK RRD Graphs ===<br />
<br />
Once you start collecting this performance data, OMD automatically uses PNP4NAGIOS to create RRD graphs for each collected metric. Check_MK then will display these RRDs in the monitoring interface. This can be useful for small scale testing where you are only collecting a few tens of metrics. However, a thorough stat collection on large Lustre file systems can yield hundreds or even thousands of individual metrics. Check_MK and PNP4NAGIOS are thoroughly outclassed when asked to display such a large number of RRD graphs and respond poorly to high I/O situations.<br />
<br />
Thus, we turn to the Graphite/Carbon metric storage system.<br />
<br />
=== Deploying Graphite/Carbon ===<br />
<br />
The Graphite/Carbon software package collects metrics and stores them in Whisper databases files. Graphite is the web frontend and Carbon is the backend that controls the Whisper database files. Whisper files are similar to RRD files in that they have a defined size and fixed constraints on how the file manages time series data as time passes. However, it has many key improvements as described here: http://graphite.readthedocs.org/en/latest/whisper.html<br />
<br />
The installation and basic setup of Graphite and Carbon is pretty easy. We used the version of Graphite found in EPEL.<br />
<br />
<code> yum install graphite-web </code><br />
<br />
This installs both Graphite/Carbon. Graphite is a basic web frontend for visualizing data. The web configuration can be found at /etc/httpd/conf.d/graphite-web.conf. While the Graphite frontend works alright, at SSEC we vastly prefer the usability of Grafana. The next section describes how that frontend is deployed and configured. <br />
<br />
There are three Carbon services that need to be set to run on startup:<br />
<br />
*carbon-aggregator<br />
*carbon-cache<br />
*carbon-relay<br />
<br />
The Carbon configuration files can be found at /etc/carbon. Below, I've linked to our settings for the various different Carbon configuration files. I don't attest to the correctness of these settings, but if you have no idea how where to start, these will at least get you up and running!<br />
<br />
*http://www.ssec.wisc.edu/~andreww/files/carbon.conf<br />
*http://www.ssec.wisc.edu/~andreww/files/storage-aggregation.conf<br />
*http://www.ssec.wisc.edu/~andreww/files/storage-schemas.conf<br />
<br />
Once Carbon is running, you can actually use the Graphite/Carbon installation if you don't want to have dashboards and such. Graphite is well documented and you can read more about the software here: http://graphite.readthedocs.org/en/latest/<br />
<br />
==== Graphite Metric Namespace ====<br />
<br />
Creating an appropriate namespace for Graphite metrics is difficult. We've gone through a dozen iterations at SSEC before arriving at one that is now largely satisfactory. The Graphite namespace refers to how you organize your metrics in the Graphite/Carbon system and how you will access them laster.<br />
<br />
Below is an example namespace for an SSEC Lustre OSS in our Delta filesystem:<br />
<br />
<code>lustre.oss.delta.delta-1-21.delta-OST0010.stats.write_bytes</code><br />
<br />
The above namespace describes the namespace for the bytes written to OST0010 on the delta-1-21 server under the lustre.oss category. You can almost think of these as paths to the RRD files in which Graphite stores metrics. Each field between the periods is mutable. For example, I could change write_bytes to read_bytes to get that metric for OST0010 on delta-1-21 or perhaps even change delta-1-21 to delta-3-11 and pick a different OST entirely. Each of those fields be named anything logical and then accessed with Graphite or Grafana to create graphs of what you are interested in visualizing. <br />
<br />
Here's another example:<br />
<br />
<code>servers.iris.compute.r720-0-5.mem.buffers</code><br />
<br />
The above namespace links to the memory buffer metrics for a server named r720-0-5 of the compute type in the SSEC HPC Clustre Iris. Look at the chart below to get the best idea of how the namespace works withing Graphite in practice. One of the great things about Graphite is that you can use wildcards to match all metrics in a given namespace field. See how you might use that in practice below:<br />
<br />
{| class="wikitable"<br />
|-<br />
!servers!!iris!!compute!!r720-0-5!!mem!!buffers<br />
|-<br />
|servers||iris||compute||r720-0-5||cpu_usage||user<br />
|-<br />
|servers||iris||compute||*||cpu_usage||user<br />
|-<br />
|servers||iris||compute||*||cpu_usage||*<br />
|}<br />
In the above namespace examples, I used wildcards to select all metrics in the given field. These selection methods can be combined with mathematic functions to create powerful graphs. For example, the last entry in the chart would display all of the cpu_usage metrics for the compute nodes in the iris group of servers. <br />
<br />
{| class="wikitable"<br />
|-<br />
!lustre!!oss!!delta!!delta-1-21.delta!!delta-OST0010!!stats!!write_bytes<br />
|-<br />
|lustre||oss||delta||delta-1-21.delta||delta-OST0010||exportstats||172.1.1.25@o2ib||write_bytes<br />
|-<br />
|lustre||oss||delta||*||*||exportstats||172.1.1.25@o2ib||write_bytes<br />
|-<br />
|lustre||oss||delta||*||*||stats||write_bytes<br />
|}<br />
<br />
Like in the above example, these Graphite namespaces show the power of wildcards. In this case, the last entry in the chart would display the write_bytes rate of all the OSTs in the delta filesystem. Graphite's built-in mathematic functions could sum the metrics to create a graph of the write rate of the entire filesystem.<br />
<br />
=== Deploying Grafana ===<br />
<br />
Grafana is the dashboard that SSEC prefers to use for data visualization. <br />
<br />
* To read about Grafana, check out this link: http://docs.grafana.org/<br />
* To try a test install of Grafana to get a feel for use, go here: http://play.grafana.org<br />
* To install Grafana, we used the RPM available here: http://docs.grafana.org/installation/rpm/<br />
<br />
Building dashboards via the grafana gui is easy, and becomes the 'analyst tool of choice' for understanding data. These dashboards will serve for many needs.<br />
<br />
When these type of dashboards are not enough, or the workflow makes them too tedious to build you can create *scripted dashboards* in grafana. These are javascript programs, and require some coding, so are more work to create - but potentially very powerful.<br />
<br />
=== Using Graphios to Redirect Lustre Stats to Carbon ===<br />
<br />
Given our decision to stick with a single monitoring infrastructure, we needed a way to direct the performance data coming in via Nagios into Graphite. Graphios takes a Nagios performance data file and parses it for a special set of data marked via special prefixes/postfixes. Graphios then pipes this data to the Carbon ingest port for storage and access via Graphite.<br />
<br />
The Graphios Github page has lots of installation and configuration examples, some of which I've contributed. Detailed setup instructions for OMD are contained within the readme at https://github.com/shawn-sterling/graphios.<br />
<br />
[[Category:Monitoring]] [[Category:Metrics]] [[Category:Tools]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_JBOD_Monitoring&diff=1586ZFS JBOD Monitoring2016-04-13T00:28:41Z<p>Sknolin: Add note that zpool status checks are always useful.</p>
<hr />
<div>If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.<br />
<br />
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.<br />
<br />
<br />
== Disk Failure: zpool status ==<br />
<br />
To detect disk failure, simply check the zpool status.<br />
<br />
This is useful for any zfs filesystem, even those built on traditional RAID.<br />
<br />
=== Standalone Example ===<br />
<br />
To perform checks standalone scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email:<br />
<br />
<pre><br />
#!/bin/bash<br />
#<br />
# zfs monitoring script for lustre with zfs backend<br />
# uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools.<br />
<br />
HELP="<br />
This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if <br />
a pool returns a status other then ONLINE<br />
"<br />
<br />
LDEV_FILE="/etc/ldev.conf" <br />
EMAIL="admin@place.org"<br />
<br />
send_email ()<br />
{<br />
/usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF<br />
"$1"<br />
EOF<br />
}<br />
<br />
if [ ! -f $LDEV_FILE ]<br />
then<br />
/usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL<br />
exit<br />
fi<br />
<br />
for POOL in `cat $LDEV_FILE`<br />
do<br />
if [[ `echo $POOL | grep ost` ]]<br />
then<br />
POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"`<br />
POOL_STATUS=`/sbin/zpool status $POOL_NAME`<br />
<br />
#check for errors running zpool<br />
if [ ! $? ]<br />
then<br />
send_email "$POOL_STATUS" <br />
fi<br />
<br />
#get pool state <br />
POOL_STATE=`echo "$POOL_STATUS" | grep state`<br />
if [[ $POOL_STATE != *ONLINE* ]]<br />
then<br />
send_email "$POOL_STATUS"<br />
fi<br />
fi<br />
done<br />
</pre><br />
<br />
The script above can then be run as a cron job, with the below as an example of that cron job.<br />
<br />
<pre><br />
#zpool monitoring crontab for oss systems<br />
SHELL=/bin/bash<br />
PATH=/sbin:/bin:/usr/sbin:/usr/bin<br />
HOME=/tmp<br />
50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0<br />
</pre><br />
<br />
=== Check_mk Example ===<br />
<br />
This is an example check_mk script, nagios or other agent-based monitoring systems will be similar. <br />
<br />
<pre><br />
#!/bin/bash<br />
currentDate=$(date +"%y%m%d")<br />
zfsVols=$(/sbin/zpool list -H -o name)<br />
<br />
if [ "$zfsVols" == "" ]; then<br />
exit<br />
fi<br />
<br />
for volume in ${zfsVols}<br />
do<br />
if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then<br />
status=1<br />
statustxt="$volume needs to initial scrub; $statustxt"<br />
fi<br />
if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then<br />
status=0<br />
statustxt="$volume scrub in progress; $statustxt"<br />
fi<br />
<br />
scrubReportedDate=$(/sbin/zpool status $volume | grep scrub | cut -d' ' -f13-)<br />
scrubDate=$(date -d "$scrubReportedDate + 35 days" +"%y%m%d")<br />
if [ $currentDate -ge $scrubDate ]; then<br />
status=2<br />
statustxt="$volume scrub out of date; $statustxt"<br />
fi<br />
if [ $scrubDate -ge $currentDate ]; then<br />
if [[ $status != 1 && $status != 2 ]]; then<br />
status=0<br />
statustxt="$volume up to date; $statustxt"<br />
fi<br />
fi<br />
done<br />
<br />
echo "$status ZPOOL_SCRUB_STATUS - $statustxt"<br />
</pre><br />
<br />
== Predictive Failure: smartctl ==<br />
<br />
To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.<br />
<br />
Example check_mk script:<br />
<pre><br />
#!/bin/bash<br />
#<br />
<br />
DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"<br />
UNHEALTHY_COUNT=0<br />
<br />
for DISK in ${DISKS}<br />
do<br />
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`<br />
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`<br />
if [[ $HEALTHSTATUS != "OK" ]]; then<br />
status=2<br />
else<br />
status=0<br />
fi<br />
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"<br />
<br />
done<br />
</pre><br />
<br />
== Enclosure Monitoring ==<br />
<br />
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.<br />
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.<br />
<br />
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl<br />
<br />
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_JBOD_Monitoring&diff=1585ZFS JBOD Monitoring2016-04-13T00:26:32Z<p>Sknolin: Add another example, organize subheadings</p>
<hr />
<div>If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.<br />
<br />
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.<br />
<br />
<br />
<br />
== Disk Failure: zpool status ==<br />
<br />
To detect disk failure, simply check the zpool status.<br />
<br />
=== Standalone Example ===<br />
<br />
To perform checks standalone scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email:<br />
<br />
<pre><br />
#!/bin/bash<br />
#<br />
# zfs monitoring script for lustre with zfs backend<br />
# uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools.<br />
<br />
HELP="<br />
This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if <br />
a pool returns a status other then ONLINE<br />
"<br />
<br />
LDEV_FILE="/etc/ldev.conf" <br />
EMAIL="admin@place.org"<br />
<br />
send_email ()<br />
{<br />
/usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF<br />
"$1"<br />
EOF<br />
}<br />
<br />
if [ ! -f $LDEV_FILE ]<br />
then<br />
/usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL<br />
exit<br />
fi<br />
<br />
for POOL in `cat $LDEV_FILE`<br />
do<br />
if [[ `echo $POOL | grep ost` ]]<br />
then<br />
POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"`<br />
POOL_STATUS=`/sbin/zpool status $POOL_NAME`<br />
<br />
#check for errors running zpool<br />
if [ ! $? ]<br />
then<br />
send_email "$POOL_STATUS" <br />
fi<br />
<br />
#get pool state <br />
POOL_STATE=`echo "$POOL_STATUS" | grep state`<br />
if [[ $POOL_STATE != *ONLINE* ]]<br />
then<br />
send_email "$POOL_STATUS"<br />
fi<br />
fi<br />
done<br />
</pre><br />
<br />
The script above can then be run as a cron job, with the below as an example of that cron job.<br />
<br />
<pre><br />
#zpool monitoring crontab for oss systems<br />
SHELL=/bin/bash<br />
PATH=/sbin:/bin:/usr/sbin:/usr/bin<br />
HOME=/tmp<br />
50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0<br />
</pre><br />
<br />
=== Check_mk Example ===<br />
<br />
This is an example check_mk script, nagios or other agent-based monitoring systems will be similar. <br />
<br />
<pre><br />
#!/bin/bash<br />
currentDate=$(date +"%y%m%d")<br />
zfsVols=$(/sbin/zpool list -H -o name)<br />
<br />
if [ "$zfsVols" == "" ]; then<br />
exit<br />
fi<br />
<br />
for volume in ${zfsVols}<br />
do<br />
if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then<br />
status=1<br />
statustxt="$volume needs to initial scrub; $statustxt"<br />
fi<br />
if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then<br />
status=0<br />
statustxt="$volume scrub in progress; $statustxt"<br />
fi<br />
<br />
scrubReportedDate=$(/sbin/zpool status $volume | grep scrub | cut -d' ' -f13-)<br />
scrubDate=$(date -d "$scrubReportedDate + 35 days" +"%y%m%d")<br />
if [ $currentDate -ge $scrubDate ]; then<br />
status=2<br />
statustxt="$volume scrub out of date; $statustxt"<br />
fi<br />
if [ $scrubDate -ge $currentDate ]; then<br />
if [[ $status != 1 && $status != 2 ]]; then<br />
status=0<br />
statustxt="$volume up to date; $statustxt"<br />
fi<br />
fi<br />
done<br />
<br />
echo "$status ZPOOL_SCRUB_STATUS - $statustxt"<br />
</pre><br />
<br />
== Predictive Failure: smartctl ==<br />
<br />
To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.<br />
<br />
Example check_mk script:<br />
<pre><br />
#!/bin/bash<br />
#<br />
<br />
DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"<br />
UNHEALTHY_COUNT=0<br />
<br />
for DISK in ${DISKS}<br />
do<br />
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`<br />
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`<br />
if [[ $HEALTHSTATUS != "OK" ]]; then<br />
status=2<br />
else<br />
status=0<br />
fi<br />
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"<br />
<br />
done<br />
</pre><br />
<br />
== Enclosure Monitoring ==<br />
<br />
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.<br />
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.<br />
<br />
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl<br />
<br />
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_JBOD_Monitoring&diff=1584ZFS JBOD Monitoring2016-04-13T00:19:10Z<p>Sknolin: </p>
<hr />
<div>If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.<br />
<br />
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.<br />
<br />
<br />
<br />
== Disk Failure: zpool status ==<br />
<br />
To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.<br />
<br />
To perform checks outside of nagios scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email:<br />
<br />
<pre><br />
#!/bin/bash<br />
#<br />
# zfs monitoring script for lustre with zfs backend<br />
# uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools.<br />
<br />
HELP="<br />
This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if <br />
a pool returns a status other then ONLINE<br />
"<br />
<br />
LDEV_FILE="/etc/ldev.conf" <br />
EMAIL="admin@place.org"<br />
<br />
send_email ()<br />
{<br />
/usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF<br />
"$1"<br />
EOF<br />
}<br />
<br />
if [ ! -f $LDEV_FILE ]<br />
then<br />
/usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL<br />
exit<br />
fi<br />
<br />
for POOL in `cat $LDEV_FILE`<br />
do<br />
if [[ `echo $POOL | grep ost` ]]<br />
then<br />
POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"`<br />
POOL_STATUS=`/sbin/zpool status $POOL_NAME`<br />
<br />
#check for errors running zpool<br />
if [ ! $? ]<br />
then<br />
send_email "$POOL_STATUS" <br />
fi<br />
<br />
#get pool state <br />
POOL_STATE=`echo "$POOL_STATUS" | grep state`<br />
if [[ $POOL_STATE != *ONLINE* ]]<br />
then<br />
send_email "$POOL_STATUS"<br />
fi<br />
fi<br />
<br />
done<br />
</pre><br />
<br />
The script above can then be run as a cron job, with the below as an example of that cron job.<br />
<br />
<pre><br />
#zpool monitoring crontab for oss systems<br />
SHELL=/bin/bash<br />
PATH=/sbin:/bin:/usr/sbin:/usr/bin<br />
HOME=/tmp<br />
50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0<br />
</pre><br />
<br />
== Predictive Failure: smartctl ==<br />
<br />
To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.<br />
<br />
Example check_mk script:<br />
<pre><br />
#!/bin/bash<br />
#<br />
<br />
DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"<br />
UNHEALTHY_COUNT=0<br />
<br />
for DISK in ${DISKS}<br />
do<br />
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`<br />
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`<br />
if [[ $HEALTHSTATUS != "OK" ]]; then<br />
status=2<br />
else<br />
status=0<br />
fi<br />
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"<br />
<br />
done<br />
</pre><br />
<br />
== Enclosure Monitoring ==<br />
<br />
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.<br />
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.<br />
<br />
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl<br />
<br />
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=1114Lustre Monitoring and Statistics Guide2016-01-13T21:59:08Z<p>Sknolin: typo fix</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_param' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
==== Jobstats: Finding jobs doing I/O over watermark ====<br />
<br />
The Perl script show_high_jobstats.pl can be used to collect and filter current output created by jobstats from Lustre servers. It is useful to check for jobs doing I/O over a high watermark. This tool was shortly mentioned during Roland Laifer's talk at LAD'15, see references. You can download it here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/show_high_jobstats.tgz<br />
<br />
==== Jobstats: A lightweight solution to provide I/O statistics to users ====<br />
<br />
The second part of Roland Laifer's talk at LAD'15 (see references) described a lightweight solution to provide I/O statistics to users. When a batch job is submitted a user can request statistics for dedicated Lustre file systems. After job completion the batch system writes files which include job ID, file system name, user name and email address. For each file system, a cron job on one server uses these files, collects jobstats from all servers and sends an email with I/O statistics to the user. You can download scripts (which require few modifications) and a detailed description here: <br />
https://www.scc.kit.edu/scc/sw/lustre_tools/jobstats2email.tgz<br />
<br />
Note that array jobs are not well tested and might cause problems.<br />
For example, job IDs might get forged or a single job array could initiate thousands of emails. Therefore, it might be a good idea<br />
to send no emails for array jobs: The batch system could just create no input files if job arrays are used.<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
* Roland Laifer, "Lustre tools for ldiskfs investigation and lightweight I/O statistics", LAD2015. http://www.eofs.eu/fileadmin/lad2015/slides/13_Roland_Laifer_kit_20150922.pdf<br />
<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=847Lustre Monitoring and Statistics Guide2015-08-10T18:05:12Z<p>Sknolin: /* Collectl and Ganglia */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Ganglia ====<br />
<br />
# Via Collectl<br />
## '''Old collectl method'''<br />
##* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
##* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
##* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
## Newer '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Via Ganglia python plugin<br />
#* A '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=846Lustre Monitoring and Statistics Guide2015-08-10T18:00:19Z<p>Sknolin: /* Collectl and Ganglia */ clarify various options</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
The basic format of each line of the '''stats''' files is:<br />
<br />
{name of statistic} {count of events} samples [{units}]<br />
<br />
Some statistics also contain min/max/''average'' values:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values}<br />
<br />
The average (mean value) value can be computed from {sum of values}/{count of events} since it isn't possible to do floating-point math in the kernel.<br />
<br />
Some statistics also contain ''standard deviation'' data:<br />
<br />
{name of statistic} {count of events} samples [{units}] {minimum value} {maximum value} {sum of values} {sum of value squared}<br />
<br />
The standard deviation can be computed from sqrt({sum of values squared} - {mean value}²).<br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.old.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Collectl and Ganglia ====<br />
<br />
Collectl supports Lustre stats. <br />
<br />
# '''Old collectl method'''<br />
#* collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
#* ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
#* See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
# '''collectl plugin''' from https://github.com/pcpiela/collectl-lustre - Note there have recently been some changes, after collectl-3.7.3 Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 <br />
# Another '''ganglia plugin''' [https://github.com/ganglia/gmond_python_modules gmond python module] for monitoring lustre client is available via [https://github.com/ganglia ganglia github project]<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=University_of_Wisconsin_Check_MK/Graphite/Graphios_Setup_Guide&diff=799University of Wisconsin Check MK/Graphite/Graphios Setup Guide2015-07-03T03:33:51Z<p>Sknolin: Sknolin moved page University of Wisconsin Check MK/Graphite/Graphios Setup Guide to Check MK/Graphite/Graphios Setup Guide: Should not imoply UW endorsement.</p>
<hr />
<div>#REDIRECT [[Check MK/Graphite/Graphios Setup Guide]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Check_MK/Graphite/Graphios_Setup_Guide&diff=798Check MK/Graphite/Graphios Setup Guide2015-07-03T03:33:51Z<p>Sknolin: Sknolin moved page University of Wisconsin Check MK/Graphite/Graphios Setup Guide to Check MK/Graphite/Graphios Setup Guide: Should not imoply UW endorsement.</p>
<hr />
<div>== Introduction ==<br />
This guide will take the user step-by-step through the Lustre Monitoring deployment that the Space Science and Engineering Center uses for monitoring all of its Lustre file systems. The author of this guide is Andrew Wagner (andrew.wagner@ssec.wisc.edu). Where possible, I have linked to our production configuration files for software to give readers a good idea of the possible settings they can or should use for their own setups.<br />
<br />
== Hardware Requirements ==<br />
<br />
Any existing server can be used for a proof of concept version of this guide. The requirements for several thousands checks per minute are low - a small VM can easily handle the load.<br />
<br />
Our productions server can easily handle ~150k checks per minute and from a processing/disk I/O perspective can handle much more. Here are the specs:<br />
<br />
*Dell PowerEdge R515<br />
**2x 8-Core AMD Opteron 4386<br />
**300GB RAID1 15K SAS<br />
**200GB Enterprise SSD<br />
**64GB RAM<br />
<br />
== Software Requirements ==<br />
<br />
*Centos 6 x86_64<br />
*Centos 6 EPEL Repository<br />
*Configuration Management System (Puppet, Ansible, Salt, Chef, etc)<br />
<br />
== Notes on Scaling/Size of Metrics ==<br />
Here at SSEC, we are collecting ~200k metrics per minute. The setup that we have could be scaled with minimal effort to a large size. Several million metrics per minute is not out of the question. However, if you have tens of millions of metrics collected per minute, this approach will not likely work for you.<br />
<br />
== What will the final product look like? ==<br />
<br />
Before embarking a deploying this infrastructure, take a look at some example dashboards that we generated with our Grafana instance. These are not Lustre specific but show some finished products.<br />
<br />
*MDF Switch for SSEC<br />
http://snapshot.raintank.io/dashboard/snapshot/lw0eZBCgUwHtZ2hEtF1At82c6bpl443l<br />
<br />
*Datacenter Coolers<br />
http://snapshot.raintank.io/dashboard/snapshot/WXZG341nFdmWdcoEovHBPCWYmbEeumDv<br />
<br />
*Single Host<br />
http://snapshot.raintank.io/dashboard/snapshot/3SaSSlqEGyO9IjrfGru4V0nebn5TdiaD<br />
<br />
== Building the Lustre Monitoring Deployment ==<br />
<br />
=== Setting up an OMD Monitoring Server ===<br />
<br />
The first thing that we needed for our new monitoring deployment was a monitoring server. We were already using Check_MK with Nagios on our older monitoring server but the Open Monitoring Distribution nicely ties all of the components together. The distribution is available at http://omdistro.org/ and installs via RPM.<br />
<br />
On a newly deployed Centos6 machine, I installed the OMD-1.20 RPM. This takes care of all of the work of installing Nagios, Check_MK, PNP4Nagios, etc.<br />
<br />
After installation, I created the new OMD monitoring site:<br />
<br />
<code>omd create ssec</code><br />
<br />
This creates a new site that runs its own stack of Apache, Nagios, Check_MK and everything else in the OMD distribution. Now we can start the site:<br />
<br />
<code>omd start ssec</code><br />
<br />
You can now nagivate to http://example.fqdn.com/sitename of your server, i.e. http://example.ssec.wisc.edu/ssec and login with the default OMD credentials.<br />
<br />
We chose to setup LDAPS authentication versus our Active Directory server to manage authentication. There is a good discussion of how to do this here:<br />
https://mathias-kettner.de/checkmk_multisite_ldap_integration.html<br />
<br />
Additionally, we setup HTTPS for our web access to OMD:<br />
http://lists.mathias-kettner.de/pipermail/checkmk-en/2014-May/012225.html<br />
<br />
At this point, you can start configuring your monitoring server to monitor hosts! Check_MK has a lot of configuration options, but it's a lot better than managing Nagios configurations by hand. Fortunately, Check_MK is widely used and well documented. The Check_MK documentation root is available at http://mathias-kettner.de/checkmk.html. <br />
<br />
=== Deploying Agents to Lustre Hosts ===<br />
<br />
To operate, the Check_MK agent on hosts runs as an xinetd service with a config file at /etc/xinetd.d/check_mk. That file includes the IP addresses allowed to access the agent in the '''only_from''' parameter. The OMD distribution comes with Check_MK agent RPMs. I rebuilt the RPM using rpmrebuild to include our updated IP addresses for our monitoring servers.<br />
<br />
After rebuilding the RPM, push out the RPM to all hosts that will be monitored. We use a custom repository and Puppet for managing our existing software, so adding the RPM to the repo and pushing out via Puppet can be done with a simple module.<br />
<br />
After deployment, we can verify the agents work by adding them to Check_MK via the GUI or configuration file and inventorying them. This will allow us to monitor a wide array of default metrics such as CPU Load, CPU Utilization, Memory use, and many others.<br />
<br />
=== Writing Local Checks to Run via Check_MK Agent ===<br />
<br />
Now that the Check_MK agents are deployed to the Lustre servers, we can add Check_MK local agent checks to measure whatever we want. The documentation for local checks is here: http://mathias-kettner.de/checkmk_localchecks.html.<br />
<br />
The output of the check has to have a Nagios status number, Name, Performance Data, and Check Output.<br />
<br />
Check out the examples in the Check_MK documentation for formatting of output. You can use whatever language your server supports to execute the local check. At SSEC, Scott Nolin has implemented several Perl scripts to poll Lustre statistics and output in the Check_MK format. You can read more about the checks here:<br />
http://wiki.opensfs.org/Lustre_Statistics_Guide.<br />
<br />
=== Check_MK RRD Graphs ===<br />
<br />
Once you start collecting this performance data, OMD automatically uses PNP4NAGIOS to create RRD graphs for each collected metric. Check_MK then will display these RRDs in the monitoring interface. This can be useful for small scale testing where you are only collecting a few tens of metrics. However, a thorough stat collection on large Lustre file systems can yield hundreds or even thousands of individual metrics. Check_MK and PNP4NAGIOS are thoroughly outclassed when asked to display such a large number of RRD graphs and respond poorly to high I/O situations.<br />
<br />
Thus, we turn to the Graphite/Carbon metric storage system.<br />
<br />
=== Deploying Graphite/Carbon ===<br />
<br />
The Graphite/Carbon software package collects metrics and stores them in Whisper databases files. Graphite is the web frontend and Carbon is the backend that controls the Whisper database files. Whisper files are similar to RRD files in that they have a defined size and fixed constraints on how the file manages time series data as time passes. However, it has many key improvements as described here: http://graphite.readthedocs.org/en/latest/whisper.html<br />
<br />
The installation and basic setup of Graphite and Carbon is pretty easy. We used the version of Graphite found in EPEL.<br />
<br />
<code> yum install graphite-web </code><br />
<br />
This installs both Graphite/Carbon. Graphite is a basic web frontend for visualizing data. The web configuration can be found at /etc/httpd/conf.d/graphite-web.conf. While the Graphite frontend works alright, at SSEC we vastly prefer the usability of Grafana. The next section describes how that frontend is deployed and configured. <br />
<br />
There are three Carbon services that need to be set to run on startup:<br />
<br />
*carbon-aggregator<br />
*carbon-cache<br />
*carbon-relay<br />
<br />
The Carbon configuration files can be found at /etc/carbon. Below, I've linked to our settings for the various different Carbon configuration files. I don't attest to the correctness of these settings, but if you have no idea how where to start, these will at least get you up and running!<br />
<br />
*http://www.ssec.wisc.edu/~andreww/files/carbon.conf<br />
*http://www.ssec.wisc.edu/~andreww/files/storage-aggregation.conf<br />
*http://www.ssec.wisc.edu/~andreww/files/storage-schemas.conf<br />
<br />
Once Carbon is running, you can actually use the Graphite/Carbon installation if you don't want to have dashboards and such. Graphite is well documented and you can read more about the software here: http://graphite.readthedocs.org/en/latest/<br />
<br />
==== Graphite Metric Namespace ====<br />
<br />
Creating an appropriate namespace for Graphite metrics is difficult. We've gone through a dozen iterations at SSEC before arriving at one that is now largely satisfactory. The Graphite namespace refers to how you organize your metrics in the Graphite/Carbon system and how you will access them laster.<br />
<br />
Below is an example namespace for an SSEC Lustre OSS in our Delta filesystem:<br />
<br />
<code>lustre.oss.delta.delta-1-21.delta-OST0010.stats.write_bytes</code><br />
<br />
The above namespace describes the namespace for the bytes written to OST0010 on the delta-1-21 server under the lustre.oss category. You can almost think of these as paths to the RRD files in which Graphite stores metrics. Each field between the periods is mutable. For example, I could change write_bytes to read_bytes to get that metric for OST0010 on delta-1-21 or perhaps even change delta-1-21 to delta-3-11 and pick a different OST entirely. Each of those fields be named anything logical and then accessed with Graphite or Grafana to create graphs of what you are interested in visualizing. <br />
<br />
Here's another example:<br />
<br />
<code>servers.iris.compute.r720-0-5.mem.buffers</code><br />
<br />
The above namespace links to the memory buffer metrics for a server named r720-0-5 of the compute type in the SSEC HPC Clustre Iris. Look at the chart below to get the best idea of how the namespace works withing Graphite in practice. One of the great things about Graphite is that you can use wildcards to match all metrics in a given namespace field. See how you might use that in practice below:<br />
<br />
{| class="wikitable"<br />
|-<br />
!servers!!iris!!compute!!r720-0-5!!mem!!buffers<br />
|-<br />
|servers||iris||compute||r720-0-5||cpu_usage||user<br />
|-<br />
|servers||iris||compute||*||cpu_usage||user<br />
|-<br />
|servers||iris||compute||*||cpu_usage||*<br />
|}<br />
In the above namespace examples, I used wildcards to select all metrics in the given field. These selection methods can be combined with mathematic functions to create powerful graphs. For example, the last entry in the chart would display all of the cpu_usage metrics for the compute nodes in the iris group of servers. <br />
<br />
{| class="wikitable"<br />
|-<br />
!lustre!!oss!!delta!!delta-1-21.delta!!delta-OST0010!!stats!!write_bytes<br />
|-<br />
|lustre||oss||delta||delta-1-21.delta||delta-OST0010||exportstats||172.1.1.25@o2ib||write_bytes<br />
|-<br />
|lustre||oss||delta||*||*||exportstats||172.1.1.25@o2ib||write_bytes<br />
|-<br />
|lustre||oss||delta||*||*||stats||write_bytes<br />
|}<br />
<br />
Like in the above example, these Graphite namespaces show the power of wildcards. In this case, the last entry in the chart would display the write_bytes rate of all the OSTs in the delta filesystem. Graphite's built-in mathematic functions could sum the metrics to create a graph of the write rate of the entire filesystem.<br />
<br />
=== Deploying Grafana ===<br />
<br />
Grafana is the dashboard that SSEC prefers to use for data visualization. <br />
<br />
* To read about Grafana, check out this link: http://docs.grafana.org/<br />
* To try a test install of Grafana to get a feel for use, go here: http://play.grafana.org<br />
* To install Grafana, we used the RPM available here: http://docs.grafana.org/installation/rpm/<br />
<br />
Building dashboards via the grafana gui is easy, and becomes the 'analyst tool of choice' for understanding data. These dashboards will serve for many needs.<br />
<br />
When these type of dashboards are not enough, or the workflow makes them too tedious to build you can create *scripted dashboards* in grafana. These are javascript programs, and require some coding, so are more work to create - but potentially very powerful.<br />
<br />
=== Using Graphios to Redirect Lustre Stats to Carbon ===<br />
<br />
Given our decision to stick with a single monitoring infrastructure, we needed a way to direct the performance data coming in via Nagios into Graphite. Graphios takes a Nagios performance data file and parses it for a special set of data marked via special prefixes/postfixes. Graphios then pipes this data to the Carbon ingest port for storage and access via Graphite.<br />
<br />
The Graphios Github page has lots of installation and configuration examples, some of which I've contributed. Detailed setup instructions for OMD are contained within the readme at https://github.com/shawn-sterling/graphios.<br />
<br />
[[Category:Monitoring]] [[Category:Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency&diff=601Lustre Resiliency: Understanding Lustre Message Loss and Tuning for Resiliency2015-05-12T01:38:12Z<p>Sknolin: Added Router Category</p>
<hr />
<div>Note to reader,<br />
<br />
The content of this page is a paper I presented at the 2015 Cray User Group conference in Chicago, IL. I'll be working over time to wiki-fy this content, and make it less Cray-specific.<br />
<br />
Thanks,<br />
Chris Horn<br />
<br />
== Introduction ==<br />
The exchange of requests and replies between hosts forms the basis of the Lustre protocol. One host will send a message containing a request to another and await a reply from that other host. The underlying network protocols used to exchange messages are abstracted away via the Lustre Network (LNet) software layer. The LNet layer is network-type agnostic. It utilizes a Lustre Network Driver (LND) layer to interface with the driver for a specific network type.\cite{UnderstandingLustreInternals}<br />
<br />
Historically, Lustre and LNet have had a poor track record dealing gracefully with the loss of a request message or the associated reply. This difficulty is largely due to their reliance on internal timers and per-message timeouts to infer message loss and the health of participating hosts. In addition, there has long been a fundamental single point of failure in the Lustre protocol whereby Lustre servers would not re-send certain requests to clients which could result in application failure.<br />
<br />
Cray has worked with Seagate and the Lustre open source community to address the flaw in the Lustre protocol, fix new issues that were discovered as part of that effort, and to tune Lustre, LNet, and LND timers and timeouts to maximize the resiliency of Lustre in the face of message loss. Changes are still landing to the canonical tree, but we expect Lustre 2.8.0 to be fully resilient to lost traffic such that clients can survive finite network disruptions without application failure and message loss has minimal performance impact.<br />
<br />
This paper covers a number of topics as they relate to Lustre resiliency. To provide some background we first discuss the role of Lustre evictions and locking in a Lustre file system, the effects of message loss, and Lustre's response to some common component failures. Finally, we discuss our recommendations for tuning Lustre to realize improved resiliency on Cray systems.<br />
<br />
== Understanding Lustre Client Evictions ==<br />
An eviction in Lustre is an action taken by a server when the server determines that the client can no longer participate in file system operations. Servers take this action to ensure the file system remains usable when a client fails or otherwise misbehaves. When a client is evicted all of its locks are invalidated. As a result, any of the client's cached inodes are invalidated and any dirty pages must be dropped.<br />
<br />
Servers can evict clients if clients do not respond to certain server requests in a timely manner, or if clients do not communicate with the server at regular intervals. Evictions of the latter sort are carried out by a server feature called the ping evictor.<br />
<br />
The ping evictor exists to prevent the problem of cascading timeouts \cite{CasTimeouts}. Since message timeouts are used to determine a connection's state a problem occurs where dependencies between requests result in implicit dependencies between connections. If a client holds a resource and dies the server would not realize this until a conflicting request for the resource is made. The server would then need to wait for a message timeout to detect the failed client. This can cascade as the number of resources held by a client increases or the number of dead clients increases. The ping evictor helps with this problem by proactively detecting failed clients and reclaiming their resources via eviction.<br />
<br />
Idle clients are expected to communicate with Lustre servers at regular intervals by sending servers a Portal RPC (PtlRPC) ping. The interval is equal to \verb&obd_timeout / 4& where \verb&obd_timeout& is a configurable Lustre parameter and defaults to 100 seconds for a 25 second ping interval (Note, Cray has historically used a 300 second \verb&obd_timeout&). The ping evictor on a server keeps track of these client pings. If a particular client has failed to deliver a ping within \verb&1.5*obd_timeout& seconds\footnote{Servers are very conservative about evicting clients. Up to two pings may be missed or lost.}, and the server has not seen any other RPC traffic from that client, then the ping evictor will evict the client.<br />
<br />
Clients can exhibit misbehavior for a number of reasons including: client side bugs, a kernel panic or oops, heavy load, and serious Lustre errors known as LBUGs. In addition, component failures external to the client may result in client evictions. We'll discuss this topic in more depth in section \ref{useCases}.<br />
<br />
A client will learn that it has been evicted the next time it connects to the server. When the client learns of its eviction it must drop all locks since they've been invalidated by the server, and it must drop all dirty pages since it no longer has the requisite locks.<br />
<br />
Clients may be unaware of their eviction if they do not have any outstanding user request and previous requests were buffered. This behavior is POSIX semantics, so applications need to check return codes or use fsync(2). Unaware users may call this silent data corruption.<br />
<br />
== Understanding Lustre Locking ==<br />
In Lustre, shared resources are protected by locks, and the most common resource is a file. When an application wishes to read or modify a file in some way it contacts the Metadata Server (MDS) for the \verb&open()& system call. The MDS provides striping information for the file. The client will then enqueue a lock for each stripe of the file to the respective Object Storage Target (OST) which are hosted by Object Storage Servers (OSS). The OSS checks whether the lock request conflicts with any other lock that has already been granted, and sends a blocking callback request to any client holding a conflicting lock. The response to this blocking callback must, eventually, be a lock cancel.<br />
<br />
Once all conflicting locks have been cancelled a completion callback request is sent to the enqueuing client which grants the lock. The client must acknowledge receipt of the completion callback before it may start using the lock.<br />
<br />
There are three lock related requests that are only ever sent from servers to clients. Two are the blocking and completion callback requests mentioned earlier, and the other is the glimpse callback request. These RPCs are sometimes referred to as Asynchronous System Trap (AST) RPCs or, more commonly, as just ASTs.<br />
<br />
When a server sends a blocking AST to a client, either as a standalone message or embedded within a completion AST, the server sets a timer after which the client shall be evicted if it has not fulfilled the server's request. This is referred to as the lock callback timer.<br />
<br />
The blocking AST informs the client that a previously granted lock must be cancelled. The client must acknowledge receipt of the blocking AST, finish any I/O occurring under the lock, and then send a lock cancel request to the server. If the lock has not been cancelled by the time the lock callback timer has expired then the client shall be evicted. The server will refresh the lock callback timer as the client performs bulk reads and writes under the lock, so the client is afforded sufficient time to complete its I/O.<br />
<br />
To illustrate why evictions are necessary consider two clients that want access to the same file. The first client, the writer, creates a file, writes some data to the file, and then promptly suffers a kernel panic. The second client, the reader, wants to read the data written by the first client. The reader requests a protected read lock for each stripe of the file from the corresponding OST. Upon receipt of the lock enqueue request, the OSS notes that it has already granted a conflicting lock to the writer. As a result the server sends a blocking AST to the writer, and it arms a lock callback timer for the lock. Since the writer has crashed it will not be able to fulfill the server's request. At this point, the server cannot grant the lock to the reader, and so the reader is unable to complete its \verb&read()&. When the lock callback timer expires the server will evict the writer which will invalidate all of the writer's locks, and, since the writer no longer holds a valid lock on the file, the server can grant the lock to the reader.<br />
<br />
== Dropped RPCs and Lustre ==<br />
When a Lustre RPC is lost users may observe performance degradation or, in the worst case, client eviction and application failure. In this section we describe some of the different issues that Lustre must deal with in order to recover successfully from an RPC loss.<br />
<br />
=== Detecting Message Loss With RPC Timeouts ===<br />
When a client sends an RPC to a server, or a server sends an RPC to a client, a timeout value is assigned to the RPC. This timeout value is derived from Lustre's Adaptive Timeouts feature, and it accounts for the time it takes to deliver the RPC (network latency) and the amount of time the message recipient needs to process the request and send a reply (service time). A host infers that a message has been lost when this timeout expires and the host has not received a response from the message recipient. This is in contrast to a failure to send an RPC in the first place which can typically be detected much sooner because LNDs will typically notify upper layers when they are unable to send an RPC.<br />
<br />
When an RPC timeout occurs the connection between the sender and recipient is severed and the client must reconnect to the server by sending a connect RPC. When a connection is reestablished with the server the client must resend the lost RPC. The client must do all of this without repeating the cycle of lost RPCs, disconnections, reconnections, and resends. In a routed environment this means we need to avoid using any bad routes which may have been the culprit in the initial lost RPC, or in the event of an HSN quiesce we would ideally only try to send messages once the quiesce was over. In the case of a lost lock cancel, or a reply to blocking AST, the client must do all of this as quickly as possible so as to avoid the lock callback timer expiration on the server and subsequent eviction.<br />
<br />
More information on Adaptive Timeouts is available in \ref{AT}.<br />
<br />
=== Avoiding Bad Routes ===<br />
When messages are dropped due to route or router failure any LNet peer using that route or router, i.e. a Lustre client or server, may be impacted, and since there are typically many more clients than routers we can expect a relatively large disruption from a single router failure. Using bad routes wastes time and resources because there are a limited number of messages that a peer can have in flight and any message sent over a bad route will need to be resent. Thus, it is very desirable to detect bad routes as quickly as possible and remove them from an LNet peer's routing table. Traditionally route and router health is determined by the LNet Router Pinger and Asymmetric Route Failure Detection features. These features work in conjunction to determine the health of both routers themselves and the routes hosted by routers.<br />
<br />
The router pinger on an LNet peer works by periodically sending an LNet ping to each known router. If a peer receives a response from the router within a timeout period then the router is considered alive.<br />
<br />
The Asymmetric Route Failure Detection feature works by packing the router's ping reply with additional information about the status, up or down, of the router's network interfaces. When an LNet peer receives a ping reply it inspects the network interface status information to determine for which remote networks the router should be used. If a remote interface on the router is not healthy then that router will not be used when sending messages to peers on the associated remote network.<br />
<br />
See section \ref{useCases} for a description of how these features are used to respond to LNet router and route failure. For more information on tuning the LNet router pinger and asymmetric route failure detection features see \ref{tuningLustreForResiliency}.<br />
<br />
=== Reconnecting with the Connect RPC ===<br />
The first thing a client must do after experiencing an RPC timeout is reestablish a connection with the target of the lost RPC. This is accomplished via the connect RPC. Lustre targets can be reached via multiple LNet nids for the purposes of failover or a multi-homed server. Connect RPCs are first sent using the LNet nid of the last known good connection before trying any alternative nids in a round-robin manner. Connect RPCs are sent on an interval, so if one connection attempt is unsuccessful there may be a delay until a client attempts the next one.<br />
<br />
This is a significant complicating factor in a client's ability to reestablish connection with the server in a timely manner. Consider the case of message loss due to a failed route. During the time it takes to detect the failed remote interface the client considers the bad route to be valid and will continue to use it for sending RPCs to the servers. If it happens to send the connect RPC using the bad route then this RPC will eventually time out. As a result the client will send subsequent connect RPCs to alternate nids where the Lustre targets may not be available. These connection attempts will be rejected, typically with \verb&-ENODEV&, as those resources are not available at that host. This delays the process of reestablishing the connection even further.<br />
<br />
We can see that the timing of our reconnect attempts is important. The faster we reconnect the more time we'll have to resend important RPCs to mitigate performance impact or avoid eviction, however, the faster we reconnect the more likely we are to hit a bad route or attempt to send while the network is otherwise unable to reliably deliver messages.<br />
<br />
=== Resending Lost RPCs ===<br />
The final action a client must take is to resend any RPCs it thinks were lost. As with the connect RPC, the client must take care to avoid bad routes when resending lost RPCs to avoid repeating the lost RPC cycle. Lustre clients have long had the ability to resend bulk RPCs, but not the ASTs described in section \ref{understandingLustreLocking}.<br />
<br />
Lustre's inability to resend ASTs was a single point of failure in the Lustre protocol that could result in client eviction whenever an AST was not delivered, or the reply to an AST was lost. We'll discuss the solution to this problem in \ref{resiliencyEnhancements}.<br />
<br />
== Failure Scenarios ==<br />
There are a number of scenarios under which Lustre messages may be lost. In this section we discuss some of the more common scenarios and Lustre's response.<br />
<br />
=== LNet Routing ===<br />
In this section we consider two resiliency issues specific to LNet routing: LNet router node death, and the death of a remote interface on an LNet router.<br />
<br />
==== Router death ====<br />
If a router node dies, e.g. because it suffered a kernel panic, then any messages in transit to that router and any messages buffered on that router at the time it died will need to be resent. The LNet layer does not track which routers are used to send particular messages, and the Portal RPC (PtlRPC) layer does not have access to this information either. As a result, PtlRPC and LNet do not know to try a different router when a previous send failed.\footnote{All routes to a remote network are used in a round-robin manner}. We thus rely on the router pinger to determine router aliveness. If a router ping was in flight at the time of the panic then we should be able to mark the router down after the ping timeout has expired. Otherwise we may need to wait for the next ping interval in addition to the time needed for the ping to timeout in order to mark the router as down. During the time between the kernel panic and marking of the router as down the router can still be used as a next-hop.<br />
<br />
==== Remote Interface Death ====<br />
As discussed in \ref{avoidingBadRoutes}, LNet peers rely on the Asymmetric Route Failure Detection feature to determine the health of remote interfaces on an LNet router node. For example, say a router has one Infiniband (IB) interface configured on LNet o2ib0, another infiniband interface configured on LNet o2ib1, and one Aries interface on LNet gni0. The LNet nids for each of these interfaces are 10.100.0.0@o2ib0, 10.100.1.1@o2ib1, and 27@gni respectively.<br />
<br />
A client on the gni LNet has two routes that utilize 27@gni for the remote networks o2ib0 and o2ib1. i.e. if the client wants to send an RPC to a server on either o2ib0 or o2ib1 it can use 27@gni as a next-hop for those messages.<br />
<br />
Now suppose the router's IB interface at 10.100.1.1@o2ib1 fails. After some period of time, detailed below, the router marks that interface as down. The client's router pinger sends a ping to the router at the next ping interval. Since the router node is up, and the gni interface is healthy, the router will respond to the client with the information that 10.100.0.0@o2ib0 is up and 10.100.1.1@o2ib1 is down. The client sees that this router's interface for the o2ib1 network is down, so it will no longer use 27@gni as a next-hop when it needs to send messages to servers on the o2ib1 LNet. However, it will continue to use 27@gni as a next-hop when it needs to send messages to servers on the o2ib0 LNet.<br />
<br />
When the router's interface on o2ib1 eventually recovers the router will mark that interface as alive as soon as it sees traffic come over that interface.\footnote{Peers on the o2ib1 LNet will be sending LNet pings to the "down" interface at an interval specified by the dead_router_check_interval LNet module parameter.} The next router ping from the client will retrieve the updated information, and the client will then begin using 27@gni as a next-hop for sending messages to peers on the o2ib1 LNet.<br />
<br />
It should be noted that it takes a significant amount of time to propagate router health information to peers, and it takes additional time to propagate a change in remote interface health to peers. Routers infer local interface health by monitoring traffic over the interfaces. The router is aware that peers will be sending router pings, and the longest interval at which these pings will occur. Thus, if an interface does not receive any traffic in this interval, plus the timeout value for router pings, then the router will assume the underlying interface is not healthy and will change its status to down.<br />
<br />
Using Lustre's default values, it takes a router 110 seconds, based on a 60 second ping interval plus a 50 second ping timeout, to mark an interface down. As mentioned previously, peers only learn about the interface status change after pinging the router. With a 60 second ping interval and 50 second ping timeout the worst case to detect a failed remote interface is on the order of 220 seconds. In practice the worst case isn't quite as bad if the local network is healthy since the peer should get a ping reply quickly. Thus, it is generally closer to the time for the router to detect the failed interface plus the ping interval or approximately 170 seconds.<br />
<br />
=== Client Death ===<br />
When a client dies it will eventually be evicted so that any resources held by that client might be reclaimed. This will be accomplished either via a lock callback timer expiration or via the ping evictor. If the client holds a conflicting lock then it may be evicted by a server issuing a blocking callback for that lock. Otherwise it will eventually be evicted by all Lustre servers when it fails to ping them after one and a half times the \verb&obd_timeout&.<br />
<br />
=== Server Death ===<br />
When a server dies in a high availability (HA) configuration its resources (Lustre targets) should failover to its HA partner. The Lustre recovery feature should generally ensure the filesystem returns to a usable state. For more information on Lustre recovery see \cite{LustreOpsManual}.<br />
<br />
=== Link Resiliency ===<br />
Cray systems are engineered to withstand the loss of certain components without requiring a system reboot. The same technology is used to allow manual removal and replacement of compute blades without a system reboot (warm swap). This technology is collectively known as link resiliency.\cite{S0041C}<br />
<br />
Traffic on a Gemini or Aries HSN is routed over what are termed links. A link is a software term for a connection between two Gemini or Aries Link Control Blocks (LCBs). A software daemon runs on each blade controller that is able to detect failed links as well as power loss to a Gemini or Aries Network Card. When a link failure is detected this daemon sends an event which is received by a software daemon, \verb&xtnlrd&, running on the System Management Workstation (SMW). The \verb&xtnlrd& daemon is responsible for coordinating the steps needed to recover from the failure. The \verb&xtnlrd& daemon also coordinates the steps needed to fulfill a warm swap request.<br />
<br />
At a high level the steps needed to recover from link failure include computing new routes that do not use the failed links, quiescing the network, installing the new routes, and unquiescing the network. It is necessary to quiesce the HSN in order to avoid inconsistent routing of traffic, dead ends, or cycles when installing the new routes.<br />
<br />
Lustre servers have no knowledge of an HSN quiesce. While the HSN is quiesced servers will not see any traffic from clients, and they will not be able to deliver messages to clients. If clients do not resume communication with servers in a timely manner, i.e. before the expiration of any lock callback timers or the ping eviction timer, then they will be evicted. Servers cannot distinguish between a client caught up in an extended network outage and a client that has crashed. Thus, it is crucial that clients, servers, and routers do whatever they can to restore connectivity and resume communication as quickly as possible. This includes ensuring that the lock callback timer accommodates this time span.<br />
<br />
Another side effect of the HSN quiesce is that LNet routers will be unable to respond to pings from a client's router pinger. This can result in routers being marked dead and removed from the routing tables of the clients. If all routers are marked down in this manner then client's will be unable to send messages to servers until the router pinger has been able to mark routers back up.<br />
<br />
== Resiliency Enhancements ==<br />
A number of enhancements have been made to improve Lustre resiliency in the face of message loss. The primary enhancement is the ability for servers to resend ASTs. Additional enhancements were made to improve the success rate of resending ASTs and avoid lock related evictions, as well as improvements to file system performance in the face of message loss. In this section we discuss these new features and improvements. In the following section we discuss how to tune Lustre for resiliency.<br />
<br />
=== LDLM Resend ===<br />
As discussed in \ref{resendLostRpcs}, Lustre servers historically did not resend ASTs. This meant that if an AST was lost then the target of that AST would almost certainly be evicted. LDLM resend is the primary enhancement made to fill this hole in Lustre's protocol.<br />
<br />
Before the LDLM resend enhancement was implemented, if a blocking or completion callback RPC was lost, because it was, say, buffered on a router when the router suffered a kernel panic, then the recipient of the RPC would be evicted once the lock callback timer expired even though it never received the RPC.<br />
<br />
The LDLM resend enhancement allows servers to resend these callback RPCs throughout the duration of the lock callback timer. Callback RPCs, like most other RPCs in Lustre, are assigned a timeout based on the adaptive timeouts feature. If that timeout expires, and a reply has not been received, then the server will resend the callback RPC.<br />
<br />
See \ref{ldlmTimeouts} for information on tuning the lock callback timer to allow the LDLM resend enhancement to function properly.<br />
<br />
=== Router Failure Detection ===<br />
Cray's LND for its Aries HSN, gnilnd, has previously utilized health information available on our high speed network to inform routers about the aliveness of peers. This is referred to as the peer health feature for routers. We've recently extended our use of this health information to clients. With this enhancement clients now receive an event when a router node fails. Upon receipt of this event gnilnd will notify LNet that the router is no longer alive and thus remove it from the client's routing table. This provides us another agent, in addition to the router pinger, which can remove bad routes from our routing tables and potentially do so faster than relying on the router pinger.<br />
<br />
=== Fast Reconnect ===<br />
Lustre clients on the aries HSN have some knowledge of an HSN quiesce due to the gnilnd. The gnilnd will participate in quiesce by suspending all transmits until the quiesce is over. While the quiesce is ongoing, connections between gnilnd peers can timeout. Historically, gnilnd on a client would only attempt to reestablish a connection with a peer (router) when an upper layer generated a request. We recently added a \verb&fast_reconnect& feature, which will force gnilnd on a client to quickly reconnect to routers as soon as the quiesce is lifted. When a connection is established, gnilnd will notify LNet, which will ensure the router is considered alive and can be used as a next-hop for future sends.<br />
<br />
=== Minimizing Performance Impact ===<br />
Cray has worked to minimize the performance impact of message loss and resiliency events by fixing bugs and employing defensive programming.. An example of the former is a bug found in the early reply mechanism which resulted in RPC timeouts being extended much longer than they should be, and an example of the latter is an upstream enhancement we've recently adopted which places a limit on maximum bulk transfer times separate from the hard limit defined for adaptive timeouts.<br />
<br />
==== Early Replies ====<br />
Early reply is a feature that allows servers to request that clients extend their deadlines for RPCs. Any RPC queued on a server for processing is eligible for early replies. The early reply mechanism chooses requests that are about to expire, but still queued for service, and sends early replies for those requests to extend the deadlines.<br />
<br />
While testing Lustre resiliency in the face of LNet router loss we found a bug in the early reply algorithm for determining the extended deadline. This bug resulted in deadlines set well beyond the maximum allowable service time (see at_max in Section VI. A.). In one instance we noticed timeouts on servers of 1076 seconds, and associated timeouts on clients of over 1300 seconds when the maximum service time was configured to be 600 seconds. With the bug fixed we were able to emplace effective bounds on maximum service times and limit the performance impact of RPC loss due to router failure.<br />
<br />
==== Capping Bulk Transfer Times ====<br />
A bulk transfer in Lustre is performed for reads and writes. In either case a client sends a bulk RPC request to the server describing the transfer. The server then processes the request, performs any necessary work to prepare for the transfer, and then initiates the data transfer either from the client for writes or to the client for reads.<br />
<br />
Historically, bulk transfer times, i.e. the time to perform the actual data transfer once all prerequisite work had finished, were bounded by the maximum adaptive timeout. Bulk data transfer messages are not resent, so when one is lost there is no point in waiting such a long time. A change was made to configure a static timeout for bulk transfer separate from adaptive timeouts.<br />
<br />
=== Peer Health ===<br />
The LNet peer health feature is not a recent enhancement, but it can play an important role in Lustre resiliency for routed configurations. This feature can assist in efficiently failing messages that are sent to dead peers. When this feature is enabled, prior to sending traffic to a particular peer, LNet will query the interface the peer is on to determine whether the peer is alive or dead. If the peer is dead then LNET will abort the send. This helps us avoid attempts to communicate with known dead peers. Communicating with dead peers wastes resources including network interface credits, router buffer credits, etc., that could otherwise be used to communicate with alive peers. This feature can be used for messages sent to both Lustre clients and servers. This feature should only be enabled on LNet routers otherwise it can interfere with the LNet router pinger feature by dropping the router pings being sent from clients and servers to LNet routers.<br />
<br />
== Tuning Lustre for Resiliency ==<br />
There are a number of tunable parameters that can affect the performance of Lustre on Cray hardware during and after resiliency events. The goal is to survive transient network failure without suffering any client evictions and with minimal impact on application performance. As the discussion in \ref{droppedRpcsAndLustre} outlined, there are a number of areas to consider in the face of RPC loss. Our tuning recommendations strive to strike a balance between the competing priorities of avoiding client evictions where possible while maintaining the ability to detect misbehaving clients in a reasonable time frame.<br />
<br />
=== Adaptive Timeouts ===<br />
In a Lustre file system servers keep track of the time it takes for RPCs to be completed. This information is reported back to clients who utilize the information to estimate the time needed for future requests and set appropriate RPC timeouts. Minimum and maximum service times can be configured via the \verb&at_min& and \verb&at_max& kernel module parameters, respectively.<br />
<br />
==== at_min ====<br />
This is the minimum processing time that a server will report back to a client. Note, it is not actually the minimum amount of time a server will take to process a request. For Lustre clients on an Aries or Gemini HSN Cray's recommendation is to set this to 40 seconds. When an RPC is lost we want it to timeout quickly so that we can resend it and minimize performance impact and avoid client eviction. However, we also want to avoid unnecessary timeouts due to transient network quiesces. The 40 second value factors into our calculation for an appropriate LDLM timeout as discussed in section \ref{ldlmTimeouts}<br />
<br />
Our recommendation for Lustre servers is also 40 seconds.<br />
<br />
==== at_max ====<br />
This is the maximum amount of time that a server can take to process a request. If a server has reached this value then the RPC times out. For Lustre clients on an Aries or Gemini HSN Cray's recommendation is to set this to 400 seconds. Our goal with this value is to provide servers ample time to process requests when they are under heavy load, but also limit the potential worst case I/O delay for requests which will not be processed.<br />
<br />
The worst case I/O delay on a client resulting from message loss, assuming the underlying network recovers and is healthy, is equal to the largest potential RPC timeout that a client can set. Since clients must account for network latency to and from a server, in addition to server processing time, the largest potential RPC timeout is larger than \verb&at_max&. In fact, Lustre uses the same adaptive timeout mechanism to track and estimate network latency with the same lower and upper bounds as service time estimates. Thus, the largest potential RPC timeout that a client can set is \verb&2*at_max&. By lowering \verb&at_max& from 600 to 400 seconds we reduce the worst case I/O delay from 1200 seconds, or 20 minutes, to 800 seconds or just over 13 minutes.<br />
<br />
Our recommendation for Lustre servers is also 400 seconds.<br />
<br />
=== LDLM Timeouts ===<br />
The timeouts for LDLM RPCs use the same adaptive timeout mechanism as other RPCs, however the lower bound for the server's lock callback timer can be configured via the \verb&ldlm_enqueue_min& parameter. Per our previous discussion we know that servers must afford the client enough time to timeout a lost RPC, reconnect to the target of the lost RPC, and resend the lost RPC. In addition, since no traffic flows during an HSN quiesce we need to account for the time spent in, and time to recover from, a quiesce as well.<br />
<br />
We also know that lock callback timers are used to prevent misbehaving clients from hoarding resources and hindering file system usability, so we need to balance between the competing goals of allowing clients and servers enough time to recover from a network outage (larger \verb&ldlm_enqueue_min&) and quickly detecting misbehaving clients (lower \verb&ldlm_enqueue_min&).<br />
<br />
Figure \ref{figLdlmEnqueueMin} shows the variables which must be considered in setting an appropriate \verb&ldlm_enqueue_min&.<br />
<br />
<pre><br />
ldlm_enqueue_min = max(2*net_latency, net_latency + quiescent_time) +\\ 2*service_time<br />
</pre><br />
<br />
As mentioned in the previous section, both network latency and RPC service times have a lower bound of \verb&at_min&. The \verb&quiescent_time& in this formula is to account for the time it takes all Lustre clients to reestablish connections with all Lustre targets following an HSN quiesce. We've experimentally determined an average time to be approximately 140 seconds, but it is possible that this value may vary based on different factors such as the number of Lustre clients, the number of Lustre targets, the number of Lustre file systems mounted on each client, etc. Thus, given an \verb&at_min& of 40 seconds, we calculate an appropriate \verb&ldlm_enqueue_min& as:<br />
<br />
<pre><br />
ldlm_enqueue_min = max(2*40, 40 + 140) + 2*40 = 180 + 80 = 260<br />
</pre><br />
<br />
The value for Lustre servers should match that of clients.<br />
<br />
=== Router Pinger ===<br />
The goal of tuning the router pinger is to quickly detect bad routes and routers, so that they can be removed from routing tables. The interval at which LNet peers send pings to routers and the timeout value of each ping are configured via kernel module parameters. The ping interval can be configured separately for live and dead routers. The \verb&live_router_check_interval& specifies the time interval after which the router pinger will ping all live routers, and the \verb&dead_router_check_interval& specifies the same only for dead routers. The \verb&router_ping_timeout& parameter specifies how long a peer will wait for a response to a router ping before deciding the router is dead. The \verb&router_ping_timeout& should generally be no less than the LND timeout.<br />
<br />
==== Servers ====<br />
Since there are typically fewer servers than routers we can safely use a more frequent router ping interval and a lower ping timeout on external servers. In addition, since external servers are unable to subscribe to node failure events in the same fashion as Lustre clients on the HSN it is import to lower the ping interval and timeout so that servers can detect failed routers more quickly. Cray recommends setting both the live and dead router check interval to 35 seconds, and the router ping timeout to 10 seconds.<br />
<br />
==== Clients ====<br />
Since there are typically many more clients than routers we recommend using the default ping interval and timeout to avoid overwhelming routers with pings. The default values for Cray's Lustre clients are 60 second ping intervals for both live and dead routers, and a 50 second ping timeout. The ping interval can be increased further for very large systems.<br />
<br />
==== Routers ====<br />
While routers usually do not ping other routers they do need to be aware of when to expect pings from other peers. Specifically, they need to be aware of the longest ping interval and timeout, so that they can detect when a network interface is malfunctioning as described in \ref{remoteInterfaceDeath}. Thus, the router's \verb&live_router_check_interval& should be equal to the maximum of the server's \verb&live_router_check_interval& and the client's \verb&live_router_check_interval&. The same holds for the router's \verb&dead_router_check_interval& and \verb&router_ping_timeout&.<br />
<br />
=== Asymmetric Route Failure Detection ===<br />
The asymmetric route failure detection feature is enabled by default starting in Lustre 2.4.0. It can be explicitly enabled or disabled via the \verb&avoid_asym_router_failure& LNet module parameter.<br />
<br />
=== Lustre Network Driver Tuning ===<br />
<br />
==== Servers and Clients ====<br />
Internal testing revealed the default \verb&ko2iblnd timeout& is unnecessarily high. Cray recommends lowering the \verb&ko2iblnd timeout& and \verb&ko2iblnd keepalive& parameters to 10 and 30 seconds respectively, so that the o2iblnd can better detect transmission problems. As discussed in \ref{peerHealth}, the peer health feature should be disabled by setting \verb&ko2iblnd peer_timeout=0& and \verb&kgnilnd peer_health=0&.<br />
<br />
==== Routers ====<br />
Internal testing revealed the default \verb&ko2iblnd timeout& is unnecessarily high. Cray recommends lowering the \verb&ko2iblnd timeout& to 10 seconds, so that the o2iblnd can better detect transmission problems. The peer health feature should be enabled for the o2iblnd by setting \verb&peer_timeout& equal to the sum of the server's \verb&ko2iblnd timeout& and \verb&ko2iblnd keepalive&, or 40 seconds, and for kgnilnd by setting \verb&kgnilnd peer_health=1&.<br />
<br />
=== Parameters by Node Type ===<br />
This section provides an overview of all our recommended parameters by type of Lustre node. This is intended as a reference for the recommendations laid out in this paper, and not a comprehensive guide to all module parameters needed for a functional Lustre filesystem. Please refer to S-0010-5203 in Craydoc for complete documentation.<br />
<br />
Figure \ref{figRouterModprobe} contains a sample modprobe configuration file for an LNet router node. Figure \ref{figServerModprobe} contains a sample modprobe configuration file for a Sonexion or CLFS Lustre server. Figure \ref{figClientModprobe} contains a sample modprobe configuration file for a Lustre client on an Aries or Gemini HSN, and figure \ref{figEslModprobe} contains a sample modprobe configuration file for a CDL client.<br />
<br />
<pre><br />
options ko2iblnd timeout=10<br />
# Server's ko2iblnd timeout +<br />
# Server's ko2iblnd keep_alive<br />
options ko2iblnd peer_timeout=40<br />
options kgnilnd peer_health=1<br />
<br />
# max(server's router_ping_timeout,<br />
# client's router_ping_timeout)<br />
options lnet router_ping_timeout=50<br />
# max(server's live_router_check_interval,<br />
# client's live_router_check_interval)<br />
options lnet live_router_check_interval=60<br />
# max(server's dead_router_check_interval,<br />
# client's dead_router_check_interval)<br />
options lnet dead_router_check_interval=60<br />
</pre><br />
<br />
<pre><br />
options ko2iblnd timeout=10<br />
options ko2iblnd peer_timeout=0<br />
options ko2iblnd keepalive=30<br />
<br />
options lnet router_ping_timeout=10<br />
options lnet live_router_check_interval=35<br />
options lnet dead_router_check_interval=35<br />
options lnet avoid_asym_router_failure=1<br />
<br />
options ptlrpc at_max=400<br />
options ptlrpc at_min=40<br />
options ptlrpc ldlm_enqueue_min=260<br />
</pre><br />
<br />
</pre><br />
options kgnilnd peer_health=0<br />
<br />
options lnet router_ping_timeout=50<br />
options lnet live_router_check_interval=60<br />
options lnet dead_router_check_interval=60<br />
options lnet avoid_asym_router_failure=1<br />
<br />
options ptlrpc at_max=400<br />
options ptlrpc at_min=40<br />
options ptlrpc ldlm_enqueue_min=260<br />
</pre><br />
<br />
<pre><br />
options ko2iblnd timeout=10<br />
options ko2iblnd peer_timeout=0<br />
options ko2iblnd keepalive=30<br />
<br />
options ptlrpc at_max=400<br />
options ptlrpc at_min=40<br />
options ptlrpc ldlm_enqueue_min=260<br />
</pre><br />
<br />
== Site-Specific Tuning ==<br />
The recommendations laid out in \ref{tuningLustreForResiliency} were shown to eliminate client evictions from LNet route and router failure, as well as link resiliency events, on a 19 cabinet XC30, but it is unlikely that we can provide one set of settings for every system configuration. Thus we should consider ways in which these parameters may need to be modified.<br />
<br />
Based on our understanding of Lustre's resiliency features, the Lustre protocol, and the effects of the different tunable parameters, we can reason about how the parameter settings might be tweaked to maintain resiliency under different conditions and system configuration. Relevant factors may include routed vs. non-routed filesystems, scale, workload, and network technology.<br />
<br />
For example, if server performance profiling indicates that servers are rarely, if ever, under heavy load then \verb&at_max& might be lowered to further reduce worst case client side timeouts. Similarly, if servers are frequently under high load it may be desirable to increase \verb&at_max& to allow servers additional time to process requests and avoid unnecessary RPC timeouts. Figure \ref{figSlowOst} displays a Lustre error message which may indicate the need to increase \verb&at_max& to allow servers additional time to process requests.<br />
<br />
<pre><br />
Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound).<br />
</pre><br />
<br />
It is also likely that the lock callback timer will need to be adjusted to account for system configuration. We can look at certain messages that appear in the console log as a starting point for determining the length of the quiescent time in determining an appropriate \verb&ldlm_enqueue_min&. Figure \ref{figQuiesce} shows messages printed to the console log as part of a link resiliency event. The message originate from a single client, and they indicate the beginning of the resiliency event as well as when the client was able to reconnect to the Lustre targets. The timestamps indicate that it took this client 125 seconds to recover from the link resiliency event.<br />
<br />
<pre><br />
21:26:51.388273-05:00 c1-0c2s5n0 LNet: Quiesce start: hardware quiesce<br />
21:27:06.393195-05:00 c1-0c2s5n0 LNet: Quiesce complete: hardware quiesce<br />
21:27:13.429388-05:00 c1-0c2s5n0 LNet: Quiesce start: hardware quiesce<br />
21:27:23.435159-05:00 c1-0c2s5n0 LNet: Quiesce complete: hardware quiesce<br />
21:28:24.938501-05:00 c1-0c2s5n0 Lustre: snx11023-OST0009-osc-ffff880833997000: Connection restored to snx11023-OST0009 (at 10.149.4.7@o2ib)<br />
21:28:49.952123-05:00 c1-0c2s5n0 Lustre: snx11023-OST0002-osc-ffff880833997000: Connection restored to snx11023-OST0002 (at 10.149.4.5@o2ib)<br />
21:29:05.252357-05:00 c1-0c2s5n0 Lustre: snx11023-OST000c-osc-ffff880833997000: Connection restored to snx11023-OST000c (at 10.149.4.8@o2ib)<br />
</pre><br />
<br />
== Future Work ==<br />
Cray is continually working to improve Lustre resiliency. This section provides a brief look at some of our current work and future plans.<br />
<br />
In order to enhance our ability to test Lustre resiliency, we've developed a Network Request Scheduler (NRS) policy for fault injection. The NRS policy, termed NRS Delay, allows us to simulate high server load in resiliency testing by selectively introducing delays into server side request processing. We'll be sharing this work with the community in {\small \texttt{https://jira.hpdd.intel.com/browse/LU-6283}}<br />
<br />
Another area we'd like to improve is in our ability to control RPC timeouts. As discussed in \ref{AT}, the worst case RPC timeouts are \verb&2*at_max&. This is because an RPC timeout is the sum of the estimated network latency and estimated service time. The upper and lower bounds of both of these estimates cannot currently be configured separately. If we were able to configure them separately then we might be able to lower the worst-case timeouts further.<br />
<br />
We're also working on infrastructure to remove Lustre's dependence on the ping evictor to maintain client connections. This has two primary benefits. One, we'll be able to eliminate Lustre client pings which can be a source of jitter and I/O disruption \cite{PingEffects}. Secondly, we will be able to reclaim resources from failed clients more quickly.<br />
<br />
Lastly, work on this paper has revealed that we should be able to create guidelines for site specific tuning. It is unlikely that we can determine a one size fits all solution for configuring Lustre, so providing guidelines for determining appropriate configuration settings is crucial to ensuring every system has optimal resiliency.<br />
<br />
== Conclusion ==<br />
Historically, Lustre has done a poor job of handling message loss in a graceful manner. A flaw in the Lustre protocol would often result in client eviction whenever certain RPCs were lost, and lost messages could also cause performance degradation. Cray has worked with our support vendor and the Lustre open source community to address the flaw in the Lustre protocol and minimize the performance impact of message loss.<br />
<br />
The LDLM resend enhancement has addressed the flaw in the Lustre protocol, while additional enhancements, such as those made to LNet router failure detection and fast gnilnd reconnects, have helped increase the effectiveness of LDLM resend while also helping to mitigate the performance impact of message loss.<br />
<br />
The tuning recommendations and best practices laid out in this paper work in concert with these new capabilities to achieve improved Lustre and LNet resiliency.<br />
<br />
== References ==<br />
* F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang. Understanding Lustre Filesystem Internals Oak Ridge National Laboratory, National Center for Computational Sciences, Tech. Rep., 2009, technical Report ORNL/TM- 2009/117.<br />
* Peter Braam, Alex Tomas. (2008, February 9). The cascad- ing timeouts problem and the solution [Online]. Available: http://wiki.lustre.org/images/f/f0/Cascading-timeouts-hld.pdf<br />
* Intel. (2015). Lustre Operations Manual [Online]. Available: https://wiki.hpdd.intel.com/display/PUB/Documentation<br />
* Cray. Cray XC Systems S-0041-C [Online]. Available: http://docs.cray.com/books/S-0041-C/S-0041-C.pdf<br />
* C. Spitz, et. al. (2012). Minimizing Lustre Ping Effects at Scale on Cray Systems [Paper presented at the Cray User Group conference, Stuttgart, Germany, April 29 - May 3, 2012].<br />
<br />
[[Category:Router]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=LNET_Router&diff=600LNET Router2015-05-12T01:37:02Z<p>Sknolin: Added Router Category</p>
<hr />
<div>== Lustre Network Routers ==<br />
<br />
Lustre Network Router (LNET Router) configuration and use is not extremely well documented.<br />
<br />
http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Network-Router-Config_Fragalla.pdf<br />
<br />
http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/lustre_router_setup.html<br />
<br />
[[Category:Howto]][[Category:Router]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=LNET_Router&diff=599LNET Router2015-05-12T01:35:13Z<p>Sknolin: Created page with "== Lustre Network Routers == Lustre Network Router (LNET Router) configuration and use is not extremely well documented. http://cdn.opensfs.org/wp-content/uploads/2015/04/Lu..."</p>
<hr />
<div>== Lustre Network Routers ==<br />
<br />
Lustre Network Router (LNET Router) configuration and use is not extremely well documented.<br />
<br />
http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Network-Router-Config_Fragalla.pdf<br />
<br />
http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/lustre_router_setup.html<br />
<br />
[[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=583MDT Mirroring with ZFS and SRP2015-05-06T19:12:44Z<p>Sknolin: </p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.<br />
<br />
== Terminology ==<br />
* '''Target''' - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* '''Initiator''' - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
== Installing the SCST ==<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
=== Setting up the Devices ===<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
=== Initiator setup ===<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
<br />
== Metadata Backup ==<br />
<br />
=== Snapshots ===<br />
<br />
[[ZFS Snapshots for MDT backup]]<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases. <br />
<br />
We always backup the metadata also. This is necessary even if we have a mirrored MDT.<br />
<br />
[[Category: ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=582MDT Mirroring with ZFS and SRP2015-05-06T19:12:11Z<p>Sknolin: </p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.<br />
<br />
== Terminology ==<br />
* '''Target''' - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* '''Initiator''' - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
== Installing the SCST ==<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
=== Setting up the Devices ===<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
=== Initiator setup ===<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
<br />
== Metadata Backup ==<br />
<br />
=== Snapshots ===<br />
<br />
[[ZFS Snapshots for MDT Backup]]<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases. <br />
<br />
We always backup the metadata also. This is necessary even if we have a mirrored MDT.<br />
<br />
[[Category: ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Snapshots_for_MDT_backup&diff=581ZFS Snapshots for MDT backup2015-05-06T19:08:05Z<p>Sknolin: </p>
<hr />
<div>At SSEC we are reviewing our MDT snapshot documentation. The basics worked, be had some issues that should be documented.<br />
<br />
In the meantime our old notes are in the comments here: https://jira.hpdd.intel.com/browse/LUDOC-161<br />
<br />
Please add any notes you may have on ZFS snapshots for MDT backup and recovery.<br />
<br />
[[Category:ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Snapshots_for_MDT_backup&diff=580ZFS Snapshots for MDT backup2015-05-06T19:06:30Z<p>Sknolin: Created page with "At SSEC we are reviewing our MDT snapshot documentation to make sure whatever we put here is accurate for how we do things now. In the meantime our old notes are in the comme..."</p>
<hr />
<div>At SSEC we are reviewing our MDT snapshot documentation to make sure whatever we put here is accurate for how we do things now.<br />
<br />
In the meantime our old notes are in the comments here: https://jira.hpdd.intel.com/browse/LUDOC-161<br />
<br />
Please add any notes you may have on ZFS snapshots for MDT backup and recovery.<br />
<br />
[[Category:ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=579MDT Mirroring with ZFS and SRP2015-05-06T18:55:38Z<p>Sknolin: </p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.<br />
<br />
== Terminology ==<br />
* '''Target''' - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* '''Initiator''' - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
== Installing the SCST ==<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
=== Setting up the Devices ===<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
=== Initiator setup ===<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
<br />
== Metadata Backup ==<br />
<br />
=== Snapshots ===<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases. <br />
<br />
We always backup the metadata also. This is necessary even if we have a mirrored MDT.<br />
<br />
[[Category: ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=578MDT Mirroring with ZFS and SRP2015-05-06T18:55:02Z<p>Sknolin: </p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.<br />
<br />
== Terminology ==<br />
* '''Target''' - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* '''Initiator''' - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
== Installing the SCST ==<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
=== Setting up the Devices ===<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
=== Initiator setup ===<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
<br />
== Metadata Backup<br />
<br />
=== Snapshots ===<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases. <br />
<br />
We always backup the metadata also. This is necessary even if we have a mirrored MDT.<br />
<br />
[[Category: ZFS]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=574Lustre with ZFS Install2015-05-04T21:57:05Z<p>Sknolin: /* Helpful links */</p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
== SSEC Example ==<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
<br />
__FORCETOC__<br />
[[Category:Howto]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Category:ZFS&diff=571Category:ZFS2015-05-01T02:10:14Z<p>Sknolin: Created page with "'''Category:ZFS''' contains pages related to ZFS and Lustre. Include the tag <pre>Category:ZFS</pre> at the bottom of your page to include it in this category."</p>
<hr />
<div>'''Category:ZFS''' contains pages related to ZFS and Lustre. <br />
<br />
Include the tag <pre>[[Category:ZFS]]</pre> at the bottom of your page to include it in this category.</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Compression&diff=570ZFS Compression2015-05-01T02:09:00Z<p>Sknolin: </p>
<hr />
<div>Notes on enabling and working with zfs compression<br />
<br />
We assume this is for zfs on lustre, but will work the same for standalone zfs. To do this on a lustre/zfs system, you have to perform these commands on every OST in the system. I would definitely not bother doing this on metadata.<br />
<br />
'''NOTE'' - here I'll assume you always work on a whole pool or filesystem, but these commands should work on any arbitrary directory too.<br />
<br />
<STORAGE> = the pool, filesystem, or directory you choose.<br />
<br />
== Check ZFS compression ==<br />
<br />
<pre>zfs get compression <STORAGE></pre><br />
<br />
== Set ZFS compression ==<br />
<br />
<pre>zfs set compression=on <STORAGE></pre><br />
<br />
turns on compression with default algorithm (lzjb). You can chose values other than "on" for the compression algorithm, see 'man zfs' for details.<br />
<br />
To use a different algorithm, use for example "compression=lz4" instead of "on". <br />
<br />
'''lz4''' is likely the best choice now. It has been tested extensively, and provides very good compression balanced with performance. Basically, it stops trying to compress after compressing some initial part of the data and getting poor results. ''Details from experts on this topic is needed''<br />
<br />
I think you might only need to set this on pools, and filesystems should inherit, but to be safe, you can apply it to everything.<br />
<br />
== Check compression ratio ==<br />
<br />
On OSS: <br />
<pre>zfs get compressratio <STORAGE></pre><br />
<br />
On client machines:<br />
Keep in mind that fastidious users will likely notice their data apparently shrank if they move things around to a compressed filesystem. "du --aparent-size" on some files or directories can help show what is happening.<br />
<br />
[[Category:ZFS]][[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Compression&diff=569ZFS Compression2015-05-01T02:08:26Z<p>Sknolin: </p>
<hr />
<div>Notes on enabling and working with zfs compression<br />
<br />
We assume this is for zfs on lustre, but will work the same for standalone zfs. To do this on a lustre/zfs system, you have to perform these commands on every OST in the system. I would definitely not bother doing this on metadata.<br />
<br />
'''NOTE'' - here I'll assume you always work on a whole pool or filesystem, but these commands should work on any arbitrary directory too.<br />
<br />
<STORAGE> = the pool, filesystem, or directory you choose.<br />
<br />
== Check ZFS compression ==<br />
<br />
<pre>zfs get compression <STORAGE></pre><br />
<br />
== Set ZFS compression ==<br />
<br />
<pre>zfs set compression=on <STORAGE></pre><br />
<br />
turns on compression with default algorithm (lzjb). You can chose values other than "on" for the compression algorithm, see 'man zfs' for details.<br />
<br />
To use a different algorithm, use for example "compression=lz4" instead of "on". <br />
<br />
'''lz4''' is likely the best choice now. It has been tested extensively, and provides very good compression balanced with performance. Basically, it stops trying to compress after compressing some initial part of the data and getting poor results. ''Details from experts on this topic is needed''<br />
<br />
I think you might only need to set this on pools, and filesystems should inherit, but to be safe, you can apply it to everything.<br />
<br />
== Check compression ratio ==<br />
<br />
On OSS: <br />
<pre>zfs get compressratio <STORAGE></pre><br />
<br />
On client machines:<br />
Keep in mind that fastidious users will likely notice their data apparently shrank if they move things around to a compressed filesystem. "du --aparent-size" on some files or directories can help show what is happening.</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Compression&diff=568ZFS Compression2015-05-01T02:05:13Z<p>Sknolin: </p>
<hr />
<div>Notes on enabling and working with zfs compression<br />
<br />
We assume this is for zfs on lustre, but will work the same for standalone zfs. To do this on a lustre/zfs system, you have to perform these commands on every OST in the system. I would definitely not bother doing this on metadata.<br />
<br />
'''NOTE'' - here I'll assume you always work on a whole pool or filesystem, but these commands should work on any arbitrary directory too.<br />
<br />
<STORAGE> = the pool, filesystem, or directory you choose.<br />
<br />
== Check ZFS compression ==<br />
<br />
<pre>zfs get compression <STORAGE></pre><br />
<br />
== Set ZFS compression ==<br />
<br />
<pre>zfs set compression=on <STORAGE></pre><br />
<br />
turns on compression with default algorithm (lzjb). You can chose values other than "on" for the compression algorithm, see 'man zfs' for details.<br />
<br />
To use a different algorithm, use for example "compression=lz4" instead of "on". <br />
<br />
'''lz4''' is likely the best choice now. It has been tested extensively, and provides very good compression balanced with performance. Basically, it stops trying to compress after compressing some initial part of the data and getting poor results. ''Details from experts on this topic is needed''<br />
<br />
I think you might only need to set this on pools, and filesystems should inherit, but to be safe, you can apply it to everything.<br />
<br />
== Check compression ratio ==<br />
<br />
<pre>zfs get compressratio <STORAGE></pre></div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_Compression&diff=567ZFS Compression2015-05-01T02:04:39Z<p>Sknolin: Created page with "Notes on enabling and working with zfs compression We assume this is for zfs on lustre, but will work the same for standalone zfs. To do this on a lustre/zfs system, you have..."</p>
<hr />
<div>Notes on enabling and working with zfs compression<br />
<br />
We assume this is for zfs on lustre, but will work the same for standalone zfs. To do this on a lustre/zfs system, you have to perform these commands on every OST in the system. I would definitely not bother doing this on metadata.<br />
<br />
'''NOTE'' - here I'll assume you always work on a whole pool or filesystem, but these commands should work on any arbitrary directory too.<br />
<br />
<STORAGE> = the pool, filesystem, or directory you choose.<br />
<br />
# Check ZFS compression<br />
<br />
<pre>zfs get compression <STORAGE></pre><br />
<br />
# Set ZFS compression<br />
<br />
<pre>zfs set compression=on <STORAGE></pre><br />
<br />
turns on compression with default algorithm (lzjb). You can chose values other than "on" for the compression algorithm, see 'man zfs' for details.<br />
<br />
To use a different algorithm, use for example "compression=lz4" instead of "on". <br />
<br />
'''lz4''' is likely the best choice now. It has been tested extensively, and provides very good compression balanced with performance. Basically, it stops trying to compress after compressing some initial part of the data and getting poor results. ''Details from experts on this topic is needed''<br />
<br />
I think you might only need to set this on pools, and filesystems should inherit, but to be safe, you can apply it to everything.<br />
<br />
# Check compression ratio<br />
<br />
<pre>zfs get compressratio <STORAGE></pre></div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=566Lustre with ZFS Install2015-05-01T01:50:06Z<p>Sknolin: </p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* http://www.ufb.rug.nl/ger/docs/lustre-zfs.txt<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
== SSEC Example ==<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
<br />
__FORCETOC__<br />
[[Category:Howto]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=565Lustre with ZFS Install2015-05-01T01:49:42Z<p>Sknolin: </p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* http://www.ufb.rug.nl/ger/docs/lustre-zfs.txt<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
=== SSEC Example ===<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
__FORCETOC__<br />
[[Category:Howto]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=564Lustre with ZFS Install2015-05-01T01:46:47Z<p>Sknolin: </p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* http://www.ufb.rug.nl/ger/docs/lustre-zfs.txt<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
=== SSEC Example ===<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
[[Category:Howto]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=ZFS_JBOD_Monitoring&diff=563ZFS JBOD Monitoring2015-05-01T01:45:30Z<p>Sknolin: </p>
<hr />
<div>If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.<br />
<br />
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.<br />
<br />
== UW SSEC Solution ==<br />
<br />
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.<br />
<br />
=== Disk Failure: zpool status ===<br />
<br />
To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.<br />
<br />
=== Predictive Failure: smartctl ===<br />
<br />
To monitor predictive drive failure, we use 'smartctl' provided by the 'smartmontools' package for centos.<br />
<br />
Example check_mk script:<br />
<pre><br />
#!/bin/bash<br />
#<br />
<br />
DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"<br />
UNHEALTHY_COUNT=0<br />
<br />
for DISK in ${DISKS}<br />
do<br />
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`<br />
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`<br />
if [[ $HEALTHSTATUS != "OK" ]]; then<br />
status=2<br />
else<br />
status=0<br />
fi<br />
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"<br />
<br />
done<br />
</pre><br />
<br />
=== Enclosure Monitoring ===<br />
<br />
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and sg_ses is the best answer so far.<br />
<br />
To monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl<br />
<br />
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=562MDT Mirroring with ZFS and SRP2015-05-01T01:44:23Z<p>Sknolin: </p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation. The system worked very well in our brief testing, but was not tested enough to put in production.<br />
<br />
== Terminology ==<br />
* '''Target''' - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* '''Initiator''' - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
== Installing the SCST ==<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
=== Setting up the Devices ===<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
=== Initiator setup ===<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
<br />
== Metadata Backup<br />
<br />
=== Snapshots ===<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases. <br />
<br />
We always backup the metadata also. This is necessary even if we have a mirrored MDT.</div>Sknolinhttp://wiki.lustre.org/index.php?title=MDT_Mirroring_with_ZFS_and_SRP&diff=561MDT Mirroring with ZFS and SRP2015-05-01T01:39:24Z<p>Sknolin: initial page</p>
<hr />
<div>Original information provided by Jesse Stroik, January 2014.<br />
<br />
At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"<br />
<br />
Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.<br />
<br />
Below are notes from our investigation.<br />
<br />
== Terminology ==<br />
* _Target_ - The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).<br />
* Initiator - The system or device attempting to access the target. Client system in our case.<br />
== Protocols ==<br />
SRP - SCSI RDMA Protocol.<br />
<br />
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.<br />
<br />
iSER - iSCSI Extensions for RDMA<br />
<br />
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.<br />
<br />
== SRP Implementations ==<br />
# '''LIO''' - SRP implementation by Datera, a SV startup from 2011.<br />
<br />
It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.<br />
<br />
TargetCLI is a python CLI management interface for the targets. <br />
<br />
# '''SCST''' - SCSI Target Framework (kernel - not in official tree).<br />
<br />
This framework includes a few components:<br />
<br />
* Core/Engine software<br />
'''Target "Drivers"''' - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.<br />
* '''Storage Drivers''' - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).<br />
<br />
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.<br />
<br />
== iSER implementations ==<br />
<br />
* LIO - See information in SRP implementation. This also implements iSER.<br />
* STGT - SCSI Target Framework (userspace)<br />
** This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be mentioned in a lot of documentation.<br />
<br />
<br />
== Technologies Available w/ Summary ==<br />
<br />
# TGTD/ISIR vis scsi-target utils<br />
# LIO<br />
# SCST <br />
# Snapshots <br />
<br />
=== LIO ===<br />
<br />
This seems like it has disadvantages. <br />
<br />
it appears to not be 'safe' for writes. You'd have to disable all caching, they want you to use their OS<br />
As far as we could tell, it's effectively not open. We probably cannot implement without their OS and their support.<br />
It's behind SCST.<br />
<br />
=== SCST ===<br />
<br />
We tested this. We had to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.<br />
<br />
==== Installing the SCST ====<br />
<br />
NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS<br />
<br />
This requires the OFA OFED stack and links against it.<br />
<br />
# Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/<br />
<br />
# Extract the SCST package, and verify your Makefile lines if necessary:<br />
## export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/ export KVER=2.6.32-431.el6.x86_64<br />
<br />
# Build and install SRPT and SCST<br />
<br />
## make scst && make scst_install make srpt && make srpt_install<br />
## Then load the modules into the kernel, and set it up to start on boot.<br />
<br />
## /usr/lib/lsb/install_initd scst chkconfig --add scst<br />
## modprobe scst modprobe ib_srpt modprobe scst_vdisk<br />
<br />
# Setting up the Devices<br />
<br />
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped<br />
<br />
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.<br />
<br />
## zfs create -V 300G shps-meta/MDT zfs set canmount=off shps-meta <br />
<br />
### Now you have the device /dev/zvol/shps-meta/MDT<br />
<br />
<br />
<br />
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:<br />
<br />
scstadmin -open_dev MDT1 -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT<br />
<br />
Then list the device and target:<br />
<br />
scstadmin -list_device scstadmin -list_target ls -l /sys/kernel/scst_tgt/devices<br />
<br />
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. <br />
<br />
Define a security group (the hosts that can write):<br />
<br />
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0 scstadmin -list_group<br />
<br />
Add initiators to the group: <br />
<br />
(for testing, leave this open)<br />
<br />
<br />
Assign the LUNs to the target <br />
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1<br />
<br />
Now enable the target:<br />
<br />
<br />
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt<br />
<br />
And enable the driver:<br />
scstadmin -set_drv_attr ib_srpt -attributes enabled=1<br />
<br />
Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port<br />
<br />
cat /etc/modprobe.d/ib_srpt.conf options ib_srpt one_target_per_port=1<br />
<br />
Set up permissions for the LUN (necessary)<br />
<br />
scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS<br />
<br />
<br />
<br />
<br />
## Initiator setup<br />
<br />
<br />
On the target, ensure that this initiator has permission to access the disk:<br />
<br />
<br />
<br />
First, load the module ib_srp:<br />
<br />
modprobe ib_srp<br />
<br />
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.<br />
<br />
<br />
<br />
Now, search for the available targets:<br />
<br />
srp_daemon -oacd/dev/infiniband/umad0<br />
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)<br />
<br />
<br />
Add the a<br />
#scstadmin -add_target <br />
<br />
<br />
<br />
<br />
find /sys -iname add_target -print echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \ service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target<br />
<br />
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.<br />
<br />
<br />
<br />
Be sure to write the config:<br />
<br />
<br />
scstadmin -write_config /etc/scst.conf<br />
<br />
And then ensure the startup script is in chkconfig:<br />
<br />
chkconfig --list scst ckconfig --list srpd chkconfig --list rdma<br />
/etc/rdma/rdma.conf must contain the line:<br />
<br />
SRP_LOAD=yes<br />
Snapshots<br />
<br />
<br />
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases.<br />
<br />
<br />
# Metadata Backups<br />
<br />
<br />
We always backup the metadata also. This is necessary even if we have a backup MDT.</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=543Lustre Monitoring and Statistics Guide2015-04-29T15:11:26Z<p>Sknolin: /* References and Links */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Collectl and Ganglia ====<br />
<br />
Collectl supports Lustre stats. Note there have recently been some changes, Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 https://github.com/pcpiela/collectl-lustre<br />
<br />
This process is not based on the new versions, but they should work similarly. <br />
<br />
# collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
# ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
<br />
See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* http://cdn.opensfs.org/wp-content/uploads/2015/04/Lustre-Metrics-New-Techniques-for-Monitoring_Nolin_Wagner.pdf<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=542Lustre Monitoring and Statistics Guide2015-04-29T15:05:53Z<p>Sknolin: /* check_mk and Graphite */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Collectl and Ganglia ====<br />
<br />
Collectl supports Lustre stats. Note there have recently been some changes, Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 https://github.com/pcpiela/collectl-lustre<br />
<br />
This process is not based on the new versions, but they should work similarly. <br />
<br />
# collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
# ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
<br />
See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.PNG|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=File:Fs-dashboard.PNG&diff=541File:Fs-dashboard.PNG2015-04-29T15:04:25Z<p>Sknolin: grafana dashboard for lustre metrics</p>
<hr />
<div>grafana dashboard for lustre metrics</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=540Lustre Monitoring and Statistics Guide2015-04-29T15:03:44Z<p>Sknolin: /* check_mk and Graphite */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Collectl and Ganglia ====<br />
<br />
Collectl supports Lustre stats. Note there have recently been some changes, Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 https://github.com/pcpiela/collectl-lustre<br />
<br />
This process is not based on the new versions, but they should work similarly. <br />
<br />
# collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
# ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
<br />
See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Timestamp_graphite_jitter.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.png|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=File:Timestamp_graphite_jitter.PNG&diff=539File:Timestamp graphite jitter.PNG2015-04-29T15:03:14Z<p>Sknolin: file showing jitter between check_mk vs perl w/timestamp to graphite.</p>
<hr />
<div>file showing jitter between check_mk vs perl w/timestamp to graphite.</div>Sknolinhttp://wiki.lustre.org/index.php?title=File:Cmk-perl.PNG&diff=538File:Cmk-perl.PNG2015-04-29T15:01:11Z<p>Sknolin: Sknolin moved page File:Cmk-perl.PNG to File:Trash.PNG: bad file name / file doesn't exist</p>
<hr />
<div>#REDIRECT [[File:Trash.PNG]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=File:Trash.PNG&diff=537File:Trash.PNG2015-04-29T15:01:11Z<p>Sknolin: Sknolin moved page File:Cmk-perl.PNG to File:Trash.PNG: bad file name / file doesn't exist</p>
<hr />
<div>I don't see the file upload tool, so I can't upload the file..</div>Sknolinhttp://wiki.lustre.org/index.php?title=File:Meta-oveview.PNG&diff=534File:Meta-oveview.PNG2015-04-29T14:54:09Z<p>Sknolin: grafana metadata overview screen shot</p>
<hr />
<div>grafana metadata overview screen shot</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_Monitoring_and_Statistics_Guide&diff=533Lustre Monitoring and Statistics Guide2015-04-29T14:42:15Z<p>Sknolin: /* Jobstats */</p>
<hr />
<div>== Introduction ==<br />
<br />
This guide is by Scott Nolin (scott.nolin@ssec.wisc.edu), of the University of Wisconsin Space Science and Engineering Center.<br />
<br />
There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods for collecting and working with them.<br />
<br />
This does not include Lustre log analysis.<br />
<br />
The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.<br />
<br />
=== Adding to This Guide ===<br />
<br />
If you have improvements, corrections, or more information to share on this topic please contribute to this page. Ideally this would become a community resource.<br />
<br />
== Lustre Versions ==<br />
<br />
This information is based on working primarily with Lustre 2.4 and 2.5.<br />
<br />
== Reading /proc vs lctl ==<br />
<br />
'cat /proc/fs/lustre...' vs 'lctl get_param'<br />
<br />
With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax. <br />
<br />
== Data Formats ==<br />
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.<br />
<br />
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools. <br />
<br />
=== Stats ===<br />
<br />
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data. <br />
<br />
Example:<br />
<pre><br />
obdfilter.scratch-OST0001.stats=<br />
snapshot_time 1409777887.590578 secs.usecs<br />
read_bytes 27846475 samples [bytes] 4096 1048576 14421705314304<br />
write_bytes 16230483 samples [bytes] 1 1048576 14761109479164<br />
get_info 3735777 samples [reqs]<br />
</pre><br />
<br />
snapshot_time = when the stats were written<br />
<p><br />
For read_bytes and write_bytes:<br />
* First number = number of times (samples) the OST has handled a read or write. <br />
* Second number = the minimum read/write size<br />
* Third number = maximum read/write size<br />
* Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.<br />
<br />
=== Jobstats ===<br />
<br />
Jobstats are slightly more complex multi-line records. They look like JSON, except for the (-) blocks for each job. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data. <br />
<br />
Example: <br />
<pre><br />
obdfilter.scratch-OST0000.job_stats=job_stats:<br />
- job_id: 56744<br />
snapshot_time: 1409778251<br />
read: { samples: 18722, unit: bytes, min: 4096, max: 1048576, sum: 17105657856 }<br />
write: { samples: 478, unit: bytes, min: 1238, max: 1048576, sum: 412545938 }<br />
setattr: { samples: 0, unit: reqs } punch: { samples: 95, unit: reqs }<br />
- job_id: . . . ETC<br />
</pre><br />
<br />
Notice this is very similar to 'stats' above.<br />
<br />
=== Single ===<br />
<br />
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example: <br />
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail<br />
<br />
<br />
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384<br />
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540<br />
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532<br />
</pre><br />
<br />
=== Histogram ===<br />
<br />
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)<br />
<br />
<br />
* brw_stats<br />
* extent_stats<br />
<br />
<br />
<br />
== Interesting Statistics Files ==<br />
<br />
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.<br />
<br />
* Host Type = MDS, OSS, client<br />
* Target = "lctl get_param target"<br />
* Format = data format discussed above<br />
<br />
{| class="wikitable"<br />
|-<br />
!Host Type !! Target !! Format !! Discussion<br />
|-<br />
| MDS || mdt.*MDT*.num_exports || single || number of exports per MDT - these are clients, including other lustre servers<br />
|-<br />
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.<br />
|-<br />
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats. <br />
|-<br />
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT<br />
|-<br />
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. The exports subdirectory lists client connections by NID. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.<br />
|-<br />
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting<br />
|-<br />
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics<br />
|-<br />
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes<br />
|-<br />
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space<br />
|-<br />
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.stats || stats? but unsure of all fields meaning || lustre distributed lock manager (ldlm) stats. I do not fully understand these stats or the format. It also appears that these same stats are duplicated a single stats. My understanding of these stats comes from http://wiki.lustre.org/doxygen/HEAD/api/html/ldlm__pool_8c_source.html<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.lock_count || single || number of locks<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.granted || single || lustre distributed lock manager (ldlm) granted locks<br />
|- | OSS || ldlm.namespaces.filter-*.pool.grant_plan || single || ldlm lock planned number of granted lock<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_rate || single || ldlm lock grant rate aka 'GR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.cancel_rate || single || ldlm lock cancel rate aka 'CR'<br />
|-<br />
| OSS || ldlm.namespaces.filter-*.pool.grant_speed || single || ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.<br />
|}<br />
<br />
== Working With the Data ==<br />
<br />
Packages, tools, and techniques for working with Lustre statistics.<br />
<br />
=== Open Source Monitoring Packages ===<br />
<br />
*LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt<br />
*lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop<br />
<br />
<br />
=== Build it Yourself ===<br />
<br />
Here are basic steps and techniques for working with the Lustre statistics. <br />
<br />
# '''Gather''' the data on hosts you are monitoring. Deal with the syntax, extract what you want<br />
# '''Collect''' the data centrally - either pull or push it to your server, or collection of monitoring servers.<br />
# '''Process''' the data - this may be optional or minimal.<br />
# '''Alert''' on the data - optional but often useful.<br />
# '''Present''' the data - allow for visualization, analysis, etc.<br />
<br />
Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.<br />
<br />
Here are details of some solutions tested or in use:<br />
<br />
==== Collectl and Ganglia ====<br />
<br />
Collectl supports Lustre stats. Note there have recently been some changes, Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 https://github.com/pcpiela/collectl-lustre<br />
<br />
This process is not based on the new versions, but they should work similarly. <br />
<br />
# collectl - does the '''gather''' by writing to a text file on the host being monitored<br />
# ganglia does the '''collect''' via gmond and python script 'collectl.py' and '''present''' via ganglia web pages - there is no alerting.<br />
<br />
See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia<br />
<br />
<br />
==== Perl and Graphite ====<br />
<br />
<br />
Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.<br />
<br />
Software Used:<br />
* Graphite and Carbon - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* lustrestats scripts - these are simply run every minute via cron on the servers you monitor. For the SSEC prototype we simply sent text data via a TCP socket. The check_mk scripts in the next section have replaced these original test scripts.<br />
* Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but is not required.<br />
<br />
==== check_mk and Graphite ====<br />
<br />
Another option is instead of directly sending with perl, use a check_mk local agent check.<br />
<br />
The local agent and pnp4nagios mean a reasonable infrastructure is already in place for alerting and also collecting performance data.<br />
<br />
While collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, so timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate. <br />
<br />
Collecting via both methods allows you to see this difference. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.<br />
<br />
This data was sampled once per minute: <br />
<br />
[[File:Cmk-perl.PNG|400px]]<br />
<br />
For our uses at SSEC, this was acceptable. Sampling much more frequently will of course make the error smaller.<br />
<br />
<br />
* Graphite - http://graphite.readthedocs.org/en/latest/<br />
* Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts<br />
* OMD - check_mk, nagios, pnp4nagios<br />
* check_mk local scripts - these are called via check_mk, at whatever rate is desired. http://www.ssec.wisc.edu/~scottn/files/lustre_stats_mds.cmk http://www.ssec.wisc.edu/~scottn/files/lustre_stats_oss.cmk<br />
* graphios https://github.com/shawn-sterling/graphios - a python script to send your nagios performance data to graphite<br />
* Grafana - http://grafana.org - not required, but convenient for dashboards.<br />
<br />
'''Grafana Lustre Dashboard Screenshots:'''<br />
<br />
[[File:Meta-oveview.PNG|200px|Metadata for multiple file systems.]] [[File:Fs-dashboard.png|200px|Dashboard for a lustre file system.]]<br />
<br />
==== Logstash, python, and Graphite ====<br />
<br />
Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html<br />
<br />
==== Collectd plugin and Graphite ====<br />
<br />
This talk mentions a custom collectd plugin to send stats to graphite:<br />
http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
<br />
Unsure if the source for that plugin is available.<br />
<br />
==== A Note about Jobstats ====<br />
<br />
If using a whisper or RRD-file based solution, jobstats may not be a great fit. The strength of RRD or Whisper files are you have a set size for each metric collected. If your metrics are now per-job as opposed to only per-export or per-server, this means your ''number of metrics'' is now growing without bound.<br />
<br />
Solutions anyone?<br />
<br />
<br />
== References and Links ==<br />
<br />
<br />
* Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf<br />
* Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf<br />
* Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf <br />
* Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf<br />
* Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf<br />
<br />
* https://github.com/jhammond/lltop<br />
* https://github.com/chaos/lmt<br />
* https://github.com/chaos/cerebro<br />
* http://graphite.readthedocs.org/en/latest/<br />
* https://mathias-kettner.de/check_mk<br />
* https://github.com/shawn-sterling/graphios<br />
<br />
[[Category: Monitoring|!]]<br />
[[Category: Statistics]]<br />
[[Category: Metrics]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Talk:Lustre_with_ZFS_Install&diff=519Talk:Lustre with ZFS Install2015-04-29T03:29:30Z<p>Sknolin: Created page with "It would be better if this could be made more general - remove our specific puppet stuff and replace those packages with a brief description of what they do. ~~~~"</p>
<hr />
<div>It would be better if this could be made more general - remove our specific puppet stuff and replace those packages with a brief description of what they do. [[User:Sknolin|Sknolin]] ([[User talk:Sknolin|talk]]) 20:29, 28 April 2015 (PDT)</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=518Lustre with ZFS Install2015-04-29T03:28:38Z<p>Sknolin: </p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* http://www.ufb.rug.nl/ger/docs/lustre-zfs.txt<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
=== SSEC Example ===<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
[[Category:HowTo]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolinhttp://wiki.lustre.org/index.php?title=Lustre_with_ZFS_Install&diff=517Lustre with ZFS Install2015-04-29T03:28:12Z<p>Sknolin: formatting, categories</p>
<hr />
<div>==Introduction==<br />
<br />
This page is an attempt to provide some information on how to install Lustre with a ZFS backend. You are encouraged to add your own version, either as a separate section or by editing this page into a general guide.<br />
<br />
===Helpful links===<br />
<br />
* http://zfsonlinux.org/lustre-configure-single.html<br />
* http://www.ufb.rug.nl/ger/docs/lustre-zfs.txt<br />
* https://github.com/chaos/lustre/commit/04a38ba7 - ZFS and HA<br />
<br />
=== SSEC Example ===<br />
<br />
This version applies to systems with JBODs where ZFS manages the disk directly without a Dell Raid Controller in between. This guide is very specific for a single installation at UW SSEC: versions have changed, and we use puppet to provide various software packages and configurations. However, it is included as some information may be useful to others.<br />
<br />
==== Outline ====<br />
<br />
# Lustre Server Prep Work<br />
## OS Installation (RHEL6)<br />
### You must use the RHEL/Centos 6.4 Kernel 2.6.32-358<br />
### Use the "lustre" kickstart option which installs a 6.4 kernel<br />
### Define the host in puppet so that it is not a default host - NOTE: We Use Puppet at SSEC to distribute various required packages, other environments will vary!<br />
##Lustre 2.4 installation <br />
### Puppet Modules Needed<br />
* zfs-repo<br />
* lustre-healthcheck<br />
* ib-mellanox<br />
* check_mk_agent-ssec<br />
* puppetConfigFile<br />
* lustre-shutdown<br />
* nagios_plugins<br />
* lustre24-server-zfs<br />
* selinux-disable<br />
<br />
#Configure Metadata Controller<br />
##Map metadata drives to enclosures (with scripts to help)<br />
##For our example mds system we made aliases for 'ssd0' ssd1 ssd2 and ssd3<br />
###put these in /etc/zfs/vdev_id.conf - for example:<br />
###alias arch03e07s6 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50056b69199-lun-0 <br />
##run udevadm trigger to load drive aliases<br />
##On metadata controller, run mkfs.lustre to create metadata partition. On our example system:<br />
###Use separate MGS for multiple filesystems on same metadata server.<br />
###Separate MGS: mkfs.lustre --mgs --backfstype=zfs lustre-meta/mgs mirror d2 d3 mirror d4 d5<br />
###Separate MDT: mkfs.lustre --fsname=arcdata1 --mdt --mgsnode=172.16.23.14@o2ib --backfstype=zfs lustre-meta/arcdata1-meta<br />
###Create /etc/ldev.conf and add the metadata partition. On example system, we added:<br />
####geoarc-2-15 - MGS zfs:lustre-meta/mgs geoarc-2-15 - arcdata-MDT0000 zfs:lustre-meta/arcdata-meta <br />
###Create /etc/modprobe.d/lustre.conf<br />
####options lnet networks="o2ib" routes="tcp metadataip@o2ib0 172.16.24.[220-229]@o2ib0"<br />
####NOTE: if you do not want routing, or if you are having trouble with setup, the simple options lnet networks="o2ib" is fine<br />
##Start Lustre. If you have multiple metadata mounts, you can just run service lustre start.<br />
##Add lnet service to chkconfig and ensure on startup. We may want to leave lustre off on startup for metadata controllers.<br />
<br />
#Configure OSTs<br />
##Map drives to enclosures (with scripts to help!)<br />
##Run udevadm trigger to load drive aliases.<br />
##mkfs.lustre on MD1200s. <br />
###Example RAIDZ2 on one MD1200: mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e17s0 e17s1 e17s2 e17s3 e17s4 e17s5 e17s6 e17s7 e17s8 e17s9 e17s10 e17s11<br />
###Example RAIDZ2 with 2 disks from each enclosure, 5 enclosures (our cove test example): mkfs.lustre --fsname=cove --ost --backfstype=zfs --index=0 --mgsnode=172.16.24.12@o2ib lustre-ost0/ost0 raidz2 e13s0 e13s1 e15s0 e15s1 e17s0 e17s1 e19s0 e19s1 e21s0 e21s1<br />
##Repeat as necessary for additional enclosures.<br />
##Create /etc/ldev.conf<br />
###Example on lustre2-8-11:<br />
###lustre2-8-11 - cove-OST0000 zfs:lustre-ost0/ost0 lustre2-8-11 - cove-OST0001 zfs:lustre-ost1/ost1 lustre2-8-11 - cove-OST0002 zfs:lustre-ost2/ost2<br />
##Start OSTs. Example: service lustre start. Repeat as necessary for additional enclosures.<br />
##Add services to chkconfig and setup.<br />
#Configure backup metadata controller (future)<br />
##Mount the Lustre file system on clients<br />
##Add entry to /etc/fstab. With our example system, our fstab entry is:<br />
###172.16.24.12@o2ib:/cove /cove lustre defaults,_netdev,user_xattr 0 0<br />
##Create empty folder for mountpoint, and mount file system (e.g., mkdir /cove; mount /cove).<br />
<br />
[[Category:HowTo]][[Category:ZFS]][[Category:NeedsContributions]]</div>Sknolin