Difference between revisions of "ZFS JBOD Monitoring"

From Lustre Wiki
Jump to navigation Jump to search
(Added an example of a script that can be used to look for degraded zpools)
m
Line 3: Line 3:
 
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.
 
If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.
  
== UW SSEC Solution ==
 
  
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.
 
  
=== Disk Failure: zpool status ===
+
== Disk Failure: zpool status ==
  
 
To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.
 
To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.
Line 74: Line 72:
 
</pre>
 
</pre>
  
=== Predictive Failure: smartctl ===
+
== Predictive Failure: smartctl ==
  
To monitor predictive drive failure, we use 'smartctl' provided by the 'smartmontools' package for centos.
+
To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.
  
 
Example check_mk script:
 
Example check_mk script:
Line 100: Line 98:
 
</pre>
 
</pre>
  
=== Enclosure Monitoring ===
+
== Enclosure Monitoring ==
  
 +
This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.
 
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.
 
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.
  
To monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl
+
At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl
  
 
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]
 
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]

Revision as of 16:19, 12 April 2016

If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.

If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.


Disk Failure: zpool status

To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.

To perform checks outside of nagios scripts can be written to use the zpool status command. An example is as follows, which uses the ldev.conf file to know pool names associated with lustre, and mutt to send the email:

#!/bin/bash
#
# zfs monitoring script for lustre with zfs backend
# uses /etc/ldev.conf to locate zpools, then zpool status to find degraded pools.

HELP="
 This script uses /etc/ldev.conf and zpool status to identify mounted pools, then sends an email if 
 a pool returns a status other then ONLINE
"

LDEV_FILE="/etc/ldev.conf" 
EMAIL="[email protected]"

send_email ()
{
/usr/bin/mutt -s "zpool status warning on $HOSTNAME" $EMAIL<< EOF
"$1"
EOF
}

if [ ! -f $LDEV_FILE ]
then
	/usr/bin/mutt -s "WARNING, no ldev file found on $HOSTNAME" $EMAIL
	exit
fi

for POOL in `cat $LDEV_FILE`
do
	if [[ `echo $POOL | grep ost` ]]
	then
		POOL_NAME=`echo $POOL | cut -f2 -d":" | cut -f1 -d"/"`
		POOL_STATUS=`/sbin/zpool status $POOL_NAME`

		#check for errors running zpool
		if [ ! $? ]
		then
			send_email "$POOL_STATUS"	
		fi

		#get pool state		
		POOL_STATE=`echo "$POOL_STATUS" | grep state`
		if [[ $POOL_STATE != *ONLINE* ]]
		then
			send_email "$POOL_STATUS"
		fi
	fi

done

The script above can then be run as a cron job, with the below as an example of that cron job.

#zpool monitoring crontab for oss systems
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
HOME=/tmp
50 06 * * * root /usr/local/zfs_monitor/zfs_mon.sh 2>&1 /dev/null; exit 0

Predictive Failure: smartctl

To monitor predictive drive failure, you can use 'smartctl' provided by the 'smartmontools' package for centos.

Example check_mk script:

#!/bin/bash
#

DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"
UNHEALTHY_COUNT=0

for DISK in ${DISKS}
do
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`
if [[ $HEALTHSTATUS != "OK" ]]; then
status=2
else
status=0
fi
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"

done

Enclosure Monitoring

This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card. While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and SCSI Enclosure Services sg_ses is the best answer so far.

At SSEC to monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl