forcing a raid recovery

Tue Nov 3 17:24:11 EST 2009

On Tue, 3 Nov 2009, Dan Ritter wrote:

> On Tue, Nov 03, 2009 at 02:12:29PM -0500, Stephen Adler wrote:
>> Hi all,
>>
>> I'm putting together a backup system at my job and in doing so setup the
>> good ol' raid 5 array. While I was putting the disk array together, I
>> read that one could encounter a problem in which you replace a failed
>> drive, the rebuilding processes will trip over another bad sector in on
>> of the drives which was good before starting the rebuilding process and
>> thus you end up with a screwed up raid array. So I was thinking of a way
>> to avoid this problem. One solution is to kick off a job once a week or
>> month in which you force the whole raid array to be read. I was thinking
>> of possibly forcing a check sum of all the files I had stored on the
>> disk. The other idea I had was to force one of the drives into a failed
>> state and then add it back in and thus force the raid to rebuild. The
>> rebuilding processes takes about 3 hours on my system which I could
>> easily execute at 2am every Sunday morning.
>
> That's why I don't use RAID5, and I do use RAID10, and I also
> have backups.

The OP is using the RAID5 for a backup, not primary storage, so I am 
sympathetic to his desire to use RAID5.

As a backup, the refusal to reconstruct past an error on an unfailed disk 
is not so serious as you might at first glance suppose. You can still read 
all the files in degraded mode, you just can't reconstruct the backup 
disks in place. You could mkfs the backup drives and run the backup again 
(with the bad drives replaced) and get a new backup. There wouldn't 
necessarily be any loss of data or great amount of sys-admin work. It 
would have to be a full backup, not incremental, but that may not be a 
fatal objection.

The problem with crashing on a double failure is more acute when it is 
primary storage, and that storage has to be copied in its entirety to 
alternate primary storage, or back from a backup, and will likely cause 
hours or days of downtime.

Taking a checksum of all the files would be insurance that any in use bad 
sectors would be noticed, but as another poster pointed out, it probably 
wouldn't help with reconstruction in place.

Daniel Feenberg

>
> The incremental disk and controller cost is paid back in
> man-hours and uptime.
>
> -dsr-
>
>
> -- 
> http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference.
> You can't defend freedom by getting rid of it.
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>