How to determine which drive in a firmware RAID is failing
2014-07
I have two drives in an Intel ICH10 RAID 1. They are not enterprise-level drives; just regular WD Caviar Black drives.
Recently, reading/writing to the mirrored volume has become extremely slow and the HDD light is on constantly. I suspect that this may be due to one of the disks becoming close to failure and attempting sector remapping. (See also What is the fastest way to force hdd to reallocate bad sectors and discard the data?). If this was an enterprise drive, it would fail quickly and cleanly, but this behavior is typical of consumer drives. Hence, it's not immediately clear which drive is bad.
Neither of the drives shows problematic SMART data (this is from the Intel SSD Toolbox which seems to be one of the few options for reading SMART data off an Intel firmware RAID):
First drive
Second drive
Unfortunately, the WD Data Lifeguard Diagnostic tool which is able to run SMART tests is completely confused by the Intel ICH10 RAID:
How can I tell which drive is the problematic one and swap it out?
From what you describe, the first drive is defective. Read Error Rate
and Re-allocated Sector Count
are non-zero. Re-allocating sectors is exactly what happens when the drive can not read a sector. It will then re-allocate this sector on the next write operation.
You can do several things to confirm this diagnosis:
Simple but uncertain: use a tool like HDD Scan to scan your disk, i.e., read every sector from your disk. You can also do this operation on your RAID 1 array. But than it is up to the RAID-firmware to decide if it will read the data from disk 1 oder disk 2. Therefore this method will not check every sector on both disks. But if disk 1 is about to fail, it is quite probable (but not guaranteed), that its SMART values will worsen.
Keep an eye on Re-allocated Sector Count
, Reallocation Event Count
and Current Pending Sector Count
. If these values go up, your drive is likely to fail soon.
Complicated but gives more certainty:
- Mount your drives in a different pc/usb-enclosure/different SATA-port.
- Boot from a Live CD (e.g. Ubuntu or Knoppix).
- Perform a read only test of your drives. You can do this by SMART commands and/or by using tools like
dd
orbadblocks
- do NOT attempt to mount the filesystem
- do NOT write anything to the drive
- when you do read-only operations, you can re-assemble the RAID without it beeing marked as faulty/inconsistent.
- Keep an eye on the same values as mentioned above. Now you should also be able to read the SMART values properly. SMART usually also has a log about previous errors that happened. Drive 1 hat at least two of them. The timestamp is usually expressed as power-on-hours. So you will have to calculate back from the current power-on-hours and see if this correlates with the time you experienced the problems.
After reboot my SATA 1TB WD1000FYPS (previously is was "Drive error") is marked 0 mb in 3ware web gui.
Complete message:
Available Drives (Controller ID 0)
Port 1 WDC WD1000FYPS-01ZKB0 0.00 MB NOT SUPPORTED [Remove Drive]
SMART gives me only Device Model and ATA protocol version 1 (not 7-8 as it must be for SATA)
What does it mean?
Just before reboot, when is was marked only with "Device Error", smart was:
Device Model: WDC WD1000FYPS-01ZKB0
Serial Number: WD-WCASJ1130***
Firmware Version: 02.01B01
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Mar 7 18:47:35 2010 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 188 186 021 Pre-fail Always - 7591
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 229
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 3
7 Seek_Error_Rate 0x000e 193 193 000 Old_age Always - 125
9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 16615
10 Spin_Retry_Count 0x0012 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 77
192 Power-Off_Retract_Count 0x0032 198 198 000 Old_age Always - 1564
193 Load_Cycle_Count 0x0032 146 146 000 Old_age Always - 164824
194 Temperature_Celsius 0x0022 117 100 000 Old_age Always - 35
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
What can be wrong with he? Can it be restored?
PS
new smart is
=== START OF INFORMATION SECTION ===
Device Model: WDC WD1000FYPS-01ZKB0
Serial Number: [No Information Found]
Firmware Version: [No Information Found]
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 1
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Mar 8 00:29:44 2010 MSK
SMART is only available in ATA Version 3 Revision 3 or greater.
We will try to proceed in spite of this.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported.
Checking for SMART support by trying SMART ENABLE command.
Command failed, ata.status=(0x00), ata.command=(0x51), ata.flags=(0x01)
Error SMART Enable failed: Input/output error
SMART ENABLE failed - this establishes that this device lacks SMART functionality.
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
PPS There was a rapid grow of " 192 Power-Off_Retract_Count " before dying. The hard was used in raid, with several hards from the same fabric packaging box (close id's). The hard drives were placed identically. Rapid means almost linear grow from 300 to 1700 in 6-7 hours. Maximal temperature was 41C. (thanks to munin's smart monitoring)
UPDATE
On the harddrive's PCB (on bottom) I have found contact pads with unusual colors. The most pads (not soldered) are Yellow, but some are blue and some are somewhere between orange and red. The max temperature for the drive was 42-43 Celsius. The 2 drives, which was next to the died one is normal, all unsoldered pads are yellow.
The harddrive was used for 2 years in RAID with rather big load.
The drive has failed. RMA it back to WD.