Mar 9 2010

More Raid tidbits – Monitoring all raid events and changing default email template

A geek really knows the importance of his or her data and backups that just avoids pulling the hair off! When one of my hard drives on a server just died after having a well served 6000+ hours of life span, I found myself really lucky as other array component of RAID1 came to the rescue. Reason was a perhaps a short circuit which could have cost me the biggest loss of my data ever, I had in my life, so a blazing smile was well deserved. Electric power is one of the infinite things that doesn’t work here like it always (oh, its a long story – I should tell some of it sometime later)!

I got an email from mdmonitor telling me about DegradedArray event. So, when I was rebuilding the array, I noticed I got no alerts about rebuild process or  array status updates which I really wanted to investigate. Till that time, I wasn’t event knowing that ‘mdadm –monitor’ only sends you the critical updates. So, I pulled up man pages and saw these are critical events:

  • DeviceDisappeared
  • Fail
  • FailSpare
  • DegradedArray

Rest of the events are not reported at all! Also, that RHEL5’s mdadm package has pre-compiled template of email that mdadm sends upon occurrence of a critical event which I wanted to change from as well cause it looks pretty immature:

This is an automatically generated mail message from mdadm running on HOSTNAME
A DegradedArray event had been detected on md device /dev/md1.
Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:
bla bla bla

Seriously, it says “faithfully”… wth? Lol. We know that all machines are faithful to a human unless they’re not broken or gay! :D It definitely needed to be changed. Checking /etc/init.d/mdmonitor at least gave an idea that its not something changeable but it uses default template when MAILADDR is specified while it doesn’t when PROGRAM parameter is used in /etc/mdadm.conf by passing on RAID array as arguments to the script which is used, instead.

I did this then.


# mdadm --detail --scan >> /etc/mdadm.conf

# echo "PROGRAM /etc/raidalerter" >> /etc/mdadm.conf
# sed -e '1i\DEVICE partitions' -i  /etc/mdadm.conf
# cat /etc/raidalerter    (create this file with below script)

#!/bin/bash
echo -e "Likely an unfavourable or a bad thing just happened to your RAID. Even if its recovering, it was a bad thing which caused this! \n\n\n" $(cat -A /proc/mdstat | sed 's/\$/\\n/g') | mail -s "$1 on $2 $3 at $HOSTNAME" some-mail-address@example.com

# chmod +x /etc/raidalerter
# service mdmonitor restart

Provided that you’ve an MTA working fine, mails would be delivered upon any of RAID incidents to the maximum verbosity possible. I don’t think that any of the hardware raids does so?!
I then tested it on a small array to make sure that alerts are deliverable.


# mdadm /dev/md0 -f /dev/sdb1 -r /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
mdadm: hot removed /dev/sdb1
# mdadm /dev/md0 -a /dev/sdb1
mdadm: re-added /dev/sdb1

Preview:

Subject: RebuildFinished on /dev/md0 at ToughGuy
Likely an unfavorable or a bad thing just happened to your RAID. Even if its recovering, it was a bad thing which caused this! Personalities :

[raid1]
md1 : active
raid1 sdb3[1] sda3[0]
724555520 blocks [2/2] [UU]
md0 : active
raid1 sdb1[1] sda1[0]
4008064 blocks [2/2] [UU]
unused devices: <none>


Arabian Beer Goggles
Expanding the C: drive (system boot partition)
Total Rsync Progress
Easiest way to create selfsigned certificates
ASCII Art in Linux
Installing HPLIP 3.9.10 on CentOS 5.4 for newer printers (HP LaserJet M1120 MFP)
Cats in an international airport!
Error 0×800706d5 upon adding a host to NLB Cluster on Windows
Piranha | High Availability Server in Red Hat