ZFS JBOD Monitoring: Difference between revisions
(→Enclosure Monitoring: add link to SES) |
m (Add note that zpool status checks are always useful.) |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page. | If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page. | ||
This | == Disk Failure: zpool status == | ||
To detect disk failure, simply check the zpool status. | |||
This is useful for any zfs filesystem, even those built on traditional RAID. | |||
=== Standalone Example === | |||
To perform checks standalone scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email: | |||
<pre> | |||
#!/bin/bash | |||
# | |||
# zfs monitoring script for lustre with zfs backend | |||
# uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools. | |||
HELP=" | |||
This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if | |||
a pool returns a status other then ONLINE | |||
" | |||
LDEV_FILE="/etc/ldev.conf" | |||
EMAIL="[email protected]" | |||
send_email () | |||
{ | |||
/usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF | |||
"$1" | |||
EOF | |||
} | |||
if [ ! -f $LDEV_FILE ] | |||
then | |||
/usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL | |||
exit | |||
fi | |||
for POOL in `cat $LDEV_FILE` | |||
do | |||
if [[ `echo $POOL | grep ost` ]] | |||
then | |||
POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"` | |||
POOL_STATUS=`/sbin/zpool status $POOL_NAME` | |||
#check for errors running zpool | |||
if [ ! $? ] | |||
then | |||
send_email "$POOL_STATUS" | |||
fi | |||
#get pool state | |||
POOL_STATE=`echo "$POOL_STATUS" | grep state` | |||
if [[ $POOL_STATE != *ONLINE* ]] | |||
then | |||
send_email "$POOL_STATUS" | |||
fi | |||
fi | |||
done | |||
</pre> | |||
The script above can then be run as a cron job, with the below as an example of that cron job. | |||
<pre> | |||
#zpool monitoring crontab for oss systems | |||
SHELL=/bin/bash | |||
PATH=/sbin:/bin:/usr/sbin:/usr/bin | |||
HOME=/tmp | |||
50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0 | |||
</pre> | |||
=== Check_mk Example === | |||
This is an example check_mk script, nagios or other agent-based monitoring systems will be similar. | |||
<pre> | |||
#!/bin/bash | |||
currentDate=$(date +"%y%m%d") | |||
zfsVols=$(/sbin/zpool list -H -o name) | |||
if [ "$zfsVols" == "" ]; then | |||
exit | |||
fi | |||
for volume in ${zfsVols} | |||
do | |||
if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then | |||
status=1 | |||
statustxt="$volume needs to initial scrub; $statustxt" | |||
fi | |||
if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then | |||
status=0 | |||
statustxt="$volume scrub in progress; $statustxt" | |||
fi | |||
=== | scrubReportedDate=$(/sbin/zpool status $volume | grep scrub | cut -d' ' -f13-) | ||
scrubDate=$(date -d "$scrubReportedDate + 35 days" +"%y%m%d") | |||
if [ $currentDate -ge $scrubDate ]; then | |||
status=2 | |||
statustxt="$volume scrub out of date; $statustxt" | |||
fi | |||
if [ $scrubDate -ge $currentDate ]; then | |||
if [[ $status != 1 && $status != 2 ]]; then | |||
status=0 | |||
statustxt="$volume up to date; $statustxt" | |||
fi | |||
fi | |||
done | |||
echo "$status ZPOOL_SCRUB_STATUS - $statustxt" | |||
</pre> | |||
== Predictive Failure: smartctl == | |||
To monitor predictive drive failure, | To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos. | ||
Example check_mk script: | Example check_mk script: | ||
Line 37: | Line 141: | ||
</pre> | </pre> | ||
== Enclosure Monitoring == | |||
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card. | |||
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far. | While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far. | ||
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl | |||
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]] | [[Category:Monitoring]][[Category:ZFS]][[Category:Howto]] |
Latest revision as of 17:28, 12 April 2016
If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.
Disk Failure: zpool status
To detect disk failure, simply check the zpool status.
This is useful for any zfs filesystem, even those built on traditional RAID.
Standalone Example
To perform checks standalone scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email:
#!/bin/bash # # zfs monitoring script for lustre with zfs backend # uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools. HELP=" This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if a pool returns a status other then ONLINE " LDEV_FILE="/etc/ldev.conf" EMAIL="[email protected]" send_email () { /usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF "$1" EOF } if [ ! -f $LDEV_FILE ] then /usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL exit fi for POOL in `cat $LDEV_FILE` do if [[ `echo $POOL | grep ost` ]] then POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"` POOL_STATUS=`/sbin/zpool status $POOL_NAME` #check for errors running zpool if [ ! $? ] then send_email "$POOL_STATUS" fi #get pool state POOL_STATE=`echo "$POOL_STATUS" | grep state` if [[ $POOL_STATE != *ONLINE* ]] then send_email "$POOL_STATUS" fi fi done
The script above can then be run as a cron job, with the below as an example of that cron job.
#zpool monitoring crontab for oss systems SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin HOME=/tmp 50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0
Check_mk Example
This is an example check_mk script, nagios or other agent-based monitoring systems will be similar.
#!/bin/bash currentDate=$(date +"%y%m%d") zfsVols=$(/sbin/zpool list -H -o name) if [ "$zfsVols" == "" ]; then exit fi for volume in ${zfsVols} do if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then status=1 statustxt="$volume needs to initial scrub; $statustxt" fi if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then status=0 statustxt="$volume scrub in progress; $statustxt" fi scrubReportedDate=$(/sbin/zpool status $volume | grep scrub | cut -d' ' -f13-) scrubDate=$(date -d "$scrubReportedDate + 35 days" +"%y%m%d") if [ $currentDate -ge $scrubDate ]; then status=2 statustxt="$volume scrub out of date; $statustxt" fi if [ $scrubDate -ge $currentDate ]; then if [[ $status != 1 && $status != 2 ]]; then status=0 statustxt="$volume up to date; $statustxt" fi fi done echo "$status ZPOOL_SCRUB_STATUS - $statustxt"
Predictive Failure: smartctl
To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.
Example check_mk script:
#!/bin/bash # DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)" UNHEALTHY_COUNT=0 for DISK in ${DISKS} do HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART` HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4` if [[ $HEALTHSTATUS != "OK" ]]; then status=2 else status=0 fi echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}" done
Enclosure Monitoring
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card. While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and SCSI Enclosure Services sg_ses is the best answer so far.
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl