RAID dead drive? raidtools-1.00.3-2.i386

Rich Braun richb at pioneer.ci.net
Sat Nov 26 12:20:56 EST 2005


Jerry noted...
> One of the failings of the technical community (I include Linux, Unix,
> Windows, Mac, etc) is that we do not pay a lot of attention to helping
> our user communities to properly recover from errors.

Which reminded me to point out something... there are two sets of software
tools to manage software RAID.  They are mdadm and raidtools.  Currently I
think they both have the same basic capabilities.  But in the future I would
expect mdadm to become much widespread and to get more new features.

So for my own setup I switched to mdadm instead of raidtools.  As for the
original question, which was how to replace a failed drive element, I'll
address that with another point to anyone else here who owns a software RAID
setup:  you should--right now if you haven't ever done so--test and verify
your installation.  You should be able to do the following sequence:

- pull out an active sync'ed drive, run the system for a while (with no
interruption)
- see an alert come up from your monitoring tool
- reboot the system to make sure it comes up properly
- see that your monitoring tool notifies you of a degraded array at boot
- put the drive back in and re-sync it with the array
- see that you no longer get alerts

I really emphasize the monitoring tool.  It does *no good* to have a degraded
array running for months at a time.  If you don't have monitoring, then you
will never notice a degraded array.  Mdadm does this in a very straightforward
way, I don't know if raidtools has this feature.  SuSE now has this tool built
into their installation script, but you can easily add the command
'/sbin/mdadm -F -d 60 -m username at myhost -s -c /etc/mdadm.conf' to any system.
 Don't run without it.  If you have hardware RAID, you need to figure out how
to set up monitoring.  I threw out a hardware RAID controller recently because
I couldn't figure out how to do monitoring.  A two-drive software RAID1 disk
mirror setup on Linux will deliver 95% of the performance that any hardware
RAID will provide; you really only need hardware RAID if you're running a
larger configuration.

The recovery steps for replacing a failed drive are:

- Install the drive, reboot
- Create the RAID partition(s) using fdisk, partition-id type 'fd', same
  size as your existing drive's RAID partion(s)
- Issue raidhotadd (if raidtools) or mdadm --manage --add to start the sync
- You can watch the sync in progress via 'cat /proc/mdstat'
- Re-test the configuration (with monitoring) as noted above

Referring back to some performance problems that I had a couple weeks ago, one
handy command to remember is hdparm, which will tell you about DMA and other
settings of your drives.

-rich




More information about the Discuss mailing list