hardware failure - Hard disk very slow, failing with more and more errors

05
2014-04
  • user283120

    Since a couple days, my Seagate Momentus 7200.4 has been failing more and more, possibly because of a power outage. After the "WARNING: Your hard drive is failing" (I'm using fedora), the main symptom was the slowness: constant 100 % CPU wait for hours, almost impossible to do anything. I made a backup, then I restarted and I had to do an e2fsck -y (lots of output), which I had to repeat later (didn't even boot at some point, kernel panic), I did some smartctl tests long and short, I left it alone for a night to its sector correcting or whatever.

    Now the number of errors accumulating seems lower and the computer is mostly usable, but what should I do: is there some fsck command with better effects, or some other way to make it skip the bad sectors and keep functioning, other than fixing the sectors one by one with hdparm? Or is the drive surely to be trashed?

    Excerpts from smartctl -x /dev/sda :

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
      1 Raw_Read_Error_Rate     POSR--   085   074   006    -    243348742
      5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
      7 Seek_Error_Rate         POSR--   084   060   030    -    238612361
      9 Power_On_Hours          -O--CK   087   087   000    -    11535
    198 Offline_Uncorrectable   ----C-   100   100   000    -    8
    199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
    240 Head_Flying_Hours       ------   100   253   000    -    132680129719553
    241 Total_LBAs_Written      ------   100   253   000    -    2525013242
    242 Total_LBAs_Read         ------   100   253   000    -    2162196433
    
    Error 3759 [18] occurred at disk power-on lifetime: 11535 hours (480 days + 15 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER -- ST COUNT  LBA_48  LH LM LL DV DC
      -- -- -- == -- == == == -- -- -- -- --
      40 -- 51 00 00 00 22 7e 00 3d 2a 00 00  Error: UNC at LBA = 0x227e003d2a = 148142832938
    
      Commands leading to the command that caused the error were:
      CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
      -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
      60 00 00 00 08 00 22 7e 00 3d 28 40 00     18:38:24.892  READ FPDMA QUEUED
      27 00 00 00 00 00 00 00 00 00 00 e0 00     18:38:24.891  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
      ec 00 00 00 00 00 00 00 00 00 00 a0 00     18:38:24.889  IDENTIFY DEVICE
      ef 00 03 00 46 00 00 00 00 00 00 a0 00     18:38:24.889  SET FEATURES [Set transfer mode]
      27 00 00 00 00 00 00 00 00 00 00 e0 00     18:38:24.889  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
    
    
    SMART Extended Self-test Log Version: 1 (1 sectors)
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed: read failure       90%     11528         574443398
    

    More: http://p.defau.lt/?DTSGCmr7mb_anDD3IQ9Bgg http://p.defau.lt/?hNM7_BusGyz4DYLi9XX0Kg http://p.defau.lt/?wQArANAXPLnpyD87xUY6CA http://p.defau.lt/?hXbtLh27yFZhySu0y9axJw

    Update: as you said the disk is to be trashed already, I did dmesg | grep -oE "sector.+$" | sort -u and I sudo hdparm --write-sector --yes-i-know-what-i-am-doing 'd a dozen sectors. Now running another test, let's see what comes out of it.

    Update 2: I had to fix some more bad sectors with hdparm manually but, a night later, all the errors I find in the system log seem to have successfully auto-corrected as they should normally. I encountered some funny errors in the meanwhile, like distorted sound à la techno music and grep freaking out, but a yum update may have sufficed to repair them. The last smartctl -a /dev/sda completed without errors; I now have "ATA Error Count: 5004", 2 for 197 Current_Pending_Sector and 198 Offline_Uncorrectable.

    Update 3: the system is mostly usable, but the problems persist: "ATA Error Count: 9484". I sometimes have to use the hdparm trick, but I think it's not working properly because the problem later appears on the following sector. Offline_Uncorrectable is not growing, so I suspect the disk is failing to deactivate bad sectors. I guess I have to give up and buy a new one...

  • Answers
  • Julian Knight

    Hopefully all of your data is backed up?

    If not, get a new disk ASAP, one at least as large as the old and start a local backup.

    In my experience it is much easier to replace the disk sooner rather than later.

    However, if you have the cash, you might want to invest in a copy of Spinrite. Get that running on the disk - it may take days or even weeks in extreme cases. It can't always recover the disk but it does it surprisingly often. Indeed it will regularly bring disks back from the brink, I've had it resurrect a couple of laptops already. In one case, it recovered the disk to a point where it is still in use over 12 months later. In the other case, it recovered the majority of the data, enough to be able to do a more leisurely rebuild. It is around USD90 though so not cheap. If the errors were caused by a power blip from your machine, Spinrite will probably fix things up fine. If not, it will show you how bad things are & may buy you enough time to copy to another disk.

    By the way, bad sectors should be marked automatically by the firmware in the disk, you shouldn't be messing with them. Interestingly, the exercise that Spinrite puts a disk through will quite often reset bad sectors as they may have been marked due to inconsistent head movement rather than disk failure.

    By the way, as a number of researchers have discovered, the SMART warnings are pretty useless as they are not a good predictor of disk failure. Google did a large study on the matter.


  • Related Question

    smart - Is my hard drive failing?
  • conor

    I'm beginning to worry about my ~3yr old WD Green drive. In the last few days i've noticed that my media player is acting weird, it won't move to the next track after a song has finished and also won't play new songs when I double click.

    So, I downloaded the "smartmontools" package and used "sudo smartctl -a /dev/sdb2" to check out the drive. Here is a snapshot of the output:

        SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   164   163   021    Pre-fail  Always       -       6758
      4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1353
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       5846
     10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1201
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       61
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1353
    194 Temperature_Celsius     0x0022   124   112   000    Old_age   Always       -       26
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   198   000    Old_age   Always       -       3311
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
    

    I'm worried most about the "Pre-fail" rows.. does this mean that the drive could fail at any time or what?


  • Related Answers
  • MDMarra

    If you read the column headers, you'd see that pre-fail is the type of statistic that's collected not the status. The When_Failed column being empty should also give you some hints about whether or not anything has failed (nothing has).

    When in doubt, read the manpage or look for documentation on your problem.

  • RobinJ

    Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH". If the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed. If the Attribute is a pre-failure Attribute, then disk failure is imminent.

    So as long as the normalized value is higher than the thresshold value there's nothing to worry about.
    Source: http://smartmontools.sourceforge.net/man/smartctl.8.html

  • Jens Erat

    Your hard disks seems fine. No reallocated sectors, no other failed columns. To be sure. try to do a fsck and a hard disk self test: smartctl -t long /dev/sdb - this one will take some hours. You will be able to read the results using the same command you used above.

    Smart is per disk, not per volume, so pass the drive, not the volume (though smartctl seems to be clever enough to find your disk anyhow).