How to determine which drive in a firmware RAID is failing

07
2014-07
  • Andrew Mao

    I have two drives in an Intel ICH10 RAID 1. They are not enterprise-level drives; just regular WD Caviar Black drives.

    Recently, reading/writing to the mirrored volume has become extremely slow and the HDD light is on constantly. I suspect that this may be due to one of the disks becoming close to failure and attempting sector remapping. (See also What is the fastest way to force hdd to reallocate bad sectors and discard the data?). If this was an enterprise drive, it would fail quickly and cleanly, but this behavior is typical of consumer drives. Hence, it's not immediately clear which drive is bad.

    Neither of the drives shows problematic SMART data (this is from the Intel SSD Toolbox which seems to be one of the few options for reading SMART data off an Intel firmware RAID):

    First drive

    enter image description here

    Second drive

    enter image description here

    Unfortunately, the WD Data Lifeguard Diagnostic tool which is able to run SMART tests is completely confused by the Intel ICH10 RAID:

    enter image description here

    How can I tell which drive is the problematic one and swap it out?

  • Answers
  • masgo

    From what you describe, the first drive is defective. Read Error Rate and Re-allocated Sector Count are non-zero. Re-allocating sectors is exactly what happens when the drive can not read a sector. It will then re-allocate this sector on the next write operation.

    You can do several things to confirm this diagnosis:

    Simple but uncertain: use a tool like HDD Scan to scan your disk, i.e., read every sector from your disk. You can also do this operation on your RAID 1 array. But than it is up to the RAID-firmware to decide if it will read the data from disk 1 oder disk 2. Therefore this method will not check every sector on both disks. But if disk 1 is about to fail, it is quite probable (but not guaranteed), that its SMART values will worsen.

    Keep an eye on Re-allocated Sector Count, Reallocation Event Count and Current Pending Sector Count. If these values go up, your drive is likely to fail soon.

    Complicated but gives more certainty:

    1. Mount your drives in a different pc/usb-enclosure/different SATA-port.
    2. Boot from a Live CD (e.g. Ubuntu or Knoppix).
    3. Perform a read only test of your drives. You can do this by SMART commands and/or by using tools like dd or badblocks
      • do NOT attempt to mount the filesystem
      • do NOT write anything to the drive
      • when you do read-only operations, you can re-assemble the RAID without it beeing marked as faulty/inconsistent.
    4. Keep an eye on the same values as mentioned above. Now you should also be able to read the SMART values properly. SMART usually also has a log about previous errors that happened. Drive 1 hat at least two of them. The timestamp is usually expressed as power-on-hours. So you will have to calculate back from the current power-on-hours and see if this correlates with the time you experienced the problems.

  • Related Question

    hard drive - WD1000FYPS harddrive is marked 0 mb in 3ware (and no SMART)
  • osgx

    After reboot my SATA 1TB WD1000FYPS (previously is was "Drive error") is marked 0 mb in 3ware web gui.

    Complete message:

    Available Drives (Controller ID 0)
    Port 1  WDC WD1000FYPS-01ZKB0   0.00 MB NOT SUPPORTED   [Remove Drive]
    

    SMART gives me only Device Model and ATA protocol version 1 (not 7-8 as it must be for SATA)

    What does it mean?

    Just before reboot, when is was marked only with "Device Error", smart was:

    Device Model:     WDC WD1000FYPS-01ZKB0
    Serial Number:    WD-WCASJ1130***
    Firmware Version: 02.01B01
    User Capacity:    1,000,204,886,016 bytes
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Sun Mar  7 18:47:35 2010 MSK
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    SMART overall-health self-assessment test result: PASSED
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   188   186   021    Pre-fail  Always       -       7591
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       229
      5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       3
      7 Seek_Error_Rate         0x000e   193   193   000    Old_age   Always       -       125
      9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       16615
     10 Spin_Retry_Count        0x0012   100   100   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0012   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       77
    192 Power-Off_Retract_Count 0x0032   198   198   000    Old_age   Always       -       1564
    193 Load_Cycle_Count        0x0032   146   146   000    Old_age   Always       -       164824
    194 Temperature_Celsius     0x0022   117   100   000    Old_age   Always       -       35
    196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
    197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
    

    What can be wrong with he? Can it be restored?

    PS

    new smart is

    === START OF INFORMATION SECTION ===
    Device Model:     WDC WD1000FYPS-01ZKB0
    Serial Number:    [No Information Found]
    Firmware Version: [No Information Found]
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   1
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Mon Mar  8 00:29:44 2010 MSK
    SMART is only available in ATA Version 3 Revision 3 or greater.
    We will try to proceed in spite of this.
    SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported.
                      Checking for SMART support by trying SMART ENABLE command.
    Command failed, ata.status=(0x00), ata.command=(0x51), ata.flags=(0x01)
    Error SMART Enable failed: Input/output error
                      SMART ENABLE failed - this establishes that this device lacks SMART functionality.
    A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
    

    PPS There was a rapid grow of " 192 Power-Off_Retract_Count " before dying. The hard was used in raid, with several hards from the same fabric packaging box (close id's). The hard drives were placed identically. Rapid means almost linear grow from 300 to 1700 in 6-7 hours. Maximal temperature was 41C. (thanks to munin's smart monitoring)

    UPDATE

    On the harddrive's PCB (on bottom) I have found contact pads with unusual colors. The most pads (not soldered) are Yellow, but some are blue and some are somewhere between orange and red. The max temperature for the drive was 42-43 Celsius. The 2 drives, which was next to the died one is normal, all unsoldered pads are yellow.

    The harddrive was used for 2 years in RAID with rather big load.


  • Related Answers
  • Alexander Burke

    The drive has failed. RMA it back to WD.