drive dropped from RAID5 set after reboot
Tom Metro
blu at vl.com
Thu Mar 15 20:05:55 EDT 2007
On an Ubuntu Feisty system, I received notice of a degraded RAID array
after rebooting today. Investigating showed:
# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Fri Jan 26 16:20:26 2007
Raid Level : raid5
...
Raid Devices : 4
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent
...
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
...
Number Major Minor RaidDevice State
0 254 4 0 active sync /dev/mapper/sda1
1 0 0 1 removed
2 254 5 2 active sync /dev/mapper/sdc1
3 254 6 3 active sync /dev/mapper/sdd1
If it was a hardware problem or otherwise a problem with the physical
drive, I'd expect it to show up as "failed" rather than "removed."
No complaints when the device was re-added:
# mdadm -v /dev/md1 --add /dev/mapper/sdb1
mdadm: added /dev/mapper/sdb1
but it troubles me that it just disappeared on its own. dmesg doesn't
seem to show anything interesting, other than the lack of sdb1 being
picked up by md:
# dmesg | fgrep sd
...
[ 35.520480] sdb: Write Protect is off
[ 35.520483] sdb: Mode Sense: 00 3a 00 00
[ 35.520496] SCSI device sdb: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 35.520542] SCSI device sdb: 625142448 512-byte hdwr sectors (320073 MB)
[ 35.520550] sdb: Write Protect is off
[ 35.520552] sdb: Mode Sense: 00 3a 00 00
[ 35.520564] SCSI device sdb: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 35.520567] sdb: sdb1
[ 35.538213] sd 1:0:0:0: Attached scsi disk sdb
...
[ 35.939614] md: bind<sdc1>
[ 35.939797] md: bind<sdd1>
[ 35.939942] md: bind<sda1>
[ 49.731674] md: unbind<sda1>
[ 49.731684] md: export_rdev(sda1)
[ 49.731707] md: unbind<sdd1>
[ 49.731711] md: export_rdev(sdd1)
[ 49.731722] md: unbind<sdc1>
[ 49.731726] md: export_rdev(sdc1)
Other than the DegradedArray event, /var/log/daemon.log doesn't show
anything interesting. smartd didn't report any problems with /dev/sdb.
Then again, while looking into this I found:
smartd[6370]: Device: /dev/hda, opened
smartd[6370]: Device: /dev/hda, found in smartd database.
smartd[6370]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
...
smartd[6370]: Device: /dev/sda, opened
smartd[6370]: Device: /dev/sda, IE (SMART) not enabled, skip device Try
'smartctl -s on /dev/sda' to turn on SMART features
...
smartd[6370]: Device: /dev/sdb, IE (SMART) not enabled...
smartd[6370]: Device: /dev/sdc, IE (SMART) not enabled...
smartd[6370]: Device: /dev/sdd, IE (SMART) not enabled...
smartd[6370]: Monitoring 1 ATA and 0 SCSI devices
So it looks like the drives in the RAID array weren't being monitored by
smartd. Running the suggested command:
# smartctl -s on /dev/sda
smartctl version 5.36 ...
unable to fetch IEC (SMART) mode page [unsupported field in scsi
command]
A mandatory SMART command failed: exiting. To continue, add one or more
'-T permissive' options.
Seems it doesn't like these SATA drives. I'll have to investigate further...
I've noticed the device names have changed as of a reboot last weekend.
Probably due to upgrades to the udev system. The array was originally
setup with /dev/sda1 ... /dev/sdd1 and the output from /proc/mdstat
prior to a reboot last week showed:
md1 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]
and now shows:
md1 : active raid5 dm-7[4] dm-6[3] dm-5[2] dm-4[0]
but if that was the source of the problem, I'd expect it to throw off
all the devices, not just one of the drives.
It may be relevant to note that the drive was initially created in a
degraded state (4 device array with only 3 devices active), with the 4th
device being added just prior to the previous reboot. But the added
device was /dev/sda1, not /dev/sdb1.
I've also noticed on reboot a message on the console that says something
like no RAID arrays found in mdadm.conf during the last couple of
reboots, but as that file has been updated to reflect the current output
of "mdadm --detail --scan" and the array has been functioning, I've
ignored it. However, while investigating the above I noticed:
mythtv:/etc# dmesg | fgrep md:
[ 31.069854] md: raid1 personality registered for level 1
[ 31.651721] md: raid6 personality registered for level 6
[ 31.651723] md: raid5 personality registered for level 5
[ 31.651724] md: raid4 personality registered for level 4
[ 35.710310] md: md0 stopped.
[ 35.793291] md: md1 stopped.
[ 35.939614] md: bind<sdc1>
[ 35.939797] md: bind<sdd1>
[ 35.939942] md: bind<sda1>
[ 36.251952] md: array md1 already has disks!
[...80 more identical messages deleted...]
[ 49.476995] md: array md1 already has disks!
[ 49.731660] md: md1 stopped.
[ 49.731674] md: unbind<sda1>
[ 49.731684] md: export_rdev(sda1)
[ 49.731707] md: unbind<sdd1>
[ 49.731711] md: export_rdev(sdd1)
[ 49.731722] md: unbind<sdc1>
[ 49.731726] md: export_rdev(sdc1)
[ 51.613310] md: bind<dm-4>
[ 51.618923] md: bind<dm-5>
[ 51.632529] md: bind<dm-6>
[ 51.714527] md: couldn't update array info. -22
[ 51.714580] md: couldn't update array info. -22
The "array md1 already has disks" messages as well as the repeated
starting/stopping and binding and unbinding seems to suggest that
something isn't quite right.
Although maybe some of this is by design. I see in /etc/default/mdadm:
# list of arrays (or 'all') to start automatically when the initial ramdisk
# loads. This list *must* include the array holding your root
filesystem. Use
# 'none' to prevent any array from being started from the initial ramdisk.
INITRDSTART='all'
so maybe the array is being initially setup by initrd, and then being
setup again at a later stage. This system doesn't have its root file
system on the array, so I'm going to switch 'all' to 'none'.
That still leaves me without a likely cause for why the drive
disappeared from the array.
-Tom
--
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Discuss
mailing list