Difference between revisions of "ZFS JBOD Monitoring"

From Lustre Wiki
Jump to navigation Jump to search
(→‎Enclosure Monitoring: add link to SES)
Line 39: Line 39:
 
=== Enclosure Monitoring ===
 
=== Enclosure Monitoring ===
  
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and sg_ses is the best answer so far.
+
While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and [https://en.wikipedia.org/wiki/SCSI_Enclosure_Services SCSI Enclosure Services] sg_ses is the best answer so far.
  
 
To monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl
 
To monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl
  
 
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]
 
[[Category:Monitoring]][[Category:ZFS]][[Category:Howto]]

Revision as of 15:13, 7 August 2015

If using ZFS software raid (RAIDZ2 for example) to provide Lustre OST's, monitoring disk and enclosure health can be a challenge. This is because typically vendor disk array monitoring is included as part of a package with RAID controllers.

If you are aware of any vendor-supported monitoring solutions for this or have your own solution, please add to this page.

UW SSEC Solution

This information is used for linux systems and monitoring Dell MD1200 disk arrays directly attached via SAS, with no RAID card.

Disk Failure: zpool status

To detect disk failure, simply check the zpool status. There are various scripts to do this for nagios/check_mk.

Predictive Failure: smartctl

To monitor predictive drive failure, we use 'smartctl' provided by the 'smartmontools' package for centos.

Example check_mk script:

#!/bin/bash
#

DISKS="$(/bin/ls /dev/disk/by-vdev| /bin/grep -v part)"
UNHEALTHY_COUNT=0

for DISK in ${DISKS}
do
HEALTH=`smartctl -H /dev/disk/by-vdev/${DISK} | grep SMART`
HEALTHSTATUS=`echo ${HEALTH} | cut -d ' ' -f 4`
if [[ $HEALTHSTATUS != "OK" ]]; then
status=2
else
status=0
fi
echo "$status SMART_Status_${DISK} - ${DISK} ${HEALTH}"

done

Enclosure Monitoring

While the above techniques tell you if you have a disk problem, you still need to monitor the status of the arrays themselves. For our particular problem this is MD1200 disk arrays via SAS. For us, sg3_utils and SCSI Enclosure Services sg_ses is the best answer so far.

To monitor our enclosures we use this script: http://www.ssec.wisc.edu/~scottn/Lustre_ZFS_notes/script/check_md1200.pl