HDD & SSD Linux: Hard resetting link

05
2014-03
  • shanet

    my current storage setup consists of two traditional HDD's and two SSD's in my Linux box, each two on their own RAID 1 array which is encrypted via luks. I have a story of sorts, rather than a concrete question.

    For over a year now, I've randomly gotten "hard resetting link" errors in the kernel log from some of my drives. I would RMA the problem drive, and the new drives would cause the problem to stop. A few months later, I would eventually start seeing the same error again at seemingly random times. The drive would be marked as failed in RAID and no longer showed up in fdisk -l. I would reboot the computer and the drive would show up again and I could re-add to the array and it would rebuild. Sooner or later that problem would happen again, usually a few hours later.

    About six months ago, I replaced two of the traditional HDD's with SSD's in the hopes that they wouldn't have nearly as high of a failure rate as the traditional drives. However, over the past few days I started having problems with both one of the new SSD's and one of the traditional drives.

    I'm starting to see a pattern emerge. I get a new drive, a few months later I start having problems with it. I always assumed it was due to HDD's having a high failure rate, but now it's happening with SSD's so I'm thinking it isn't the drive's fault. What else could be problem? I've had multiple OS's installed since I started having the problem so I want to rule out a software issue. This leaves either the SATA cables, or the motherboard. Could the disk encryption be putting too much stress on the drives? Is there anything I can do to determine more info? Thanks as always.

    Below is the dmesg output of the problem from a question I asked a few months ago when I was having the same problem.

    [43161.734107] ata3: ATA_REG 0x41 ERR_REG 0x84
    [43161.734110] ata3: tag : dhfis dmafis sdbfis sactive
    [43161.734113] ata3: tag 0x0: 1 1 0 1  
    [43161.734123] ata3.00: exception Emask 0x1 SAct 0x1 SErr 0x180000 action 0x6 frozen
    [43161.734127] ata3.00: Ata error. fis:0x21
    [43161.734130] ata3: SError: { 10B8B Dispar }
    [43161.734134] ata3.00: failed command: READ FPDMA QUEUED
    [43161.734142] ata3.00: cmd 60/08:00:a8:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
    [43161.734144]          res 41/84:04:a8:03:00/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
    [43161.734148] ata3.00: status: { DRDY ERR }
    [43161.734150] ata3.00: error: { ICRC ABRT }
    [43161.734155] ata3: hard resetting link
    [43161.734158] ata3: nv: skipping hardreset on occupied port
    [43162.220095] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43162.260202] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43162.260206] ata3.00: revalidation failed (errno=-19)
    [43162.260211] ata3.00: limiting speed to UDMA/133:PIO2
    [43167.220123] ata3: hard resetting link
    [43167.220127] ata3: nv: skipping hardreset on occupied port
    [43167.710060] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43167.750228] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43167.750232] ata3.00: revalidation failed (errno=-19)
    [43167.750236] ata3.00: disabled
    [43172.710100] ata3: hard resetting link
    [43173.620110] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43173.640455] ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
    [43178.620116] ata3: hard resetting link
    [43179.530113] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43179.550748] ata3.00: ATA-8: WDC WD2002FAEX-007BA0, 05.01D05, max UDMA/133
    [43179.550753] ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32)
    [43179.570208] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
    [43179.570213] ata3.00: revalidation failed (errno=-19)
    [43179.570220] ata3: limiting SATA link speed to 1.5 Gbps
    [43179.570224] ata3.00: limiting speed to UDMA/133:PIO3
    [43184.530066] ata3: hard resetting link
    [43184.530070] ata3: nv: skipping hardreset on occupied port
    [43185.020091] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43185.060949] ata3.00: configured for UDMA/133
    [43185.060969] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [43185.060974] sd 2:0:0:0: [sdd]  Sense Key : Aborted Command [current] [descriptor]
    [43185.060980] Descriptor sense data with sense descriptors (in hex):
    [43185.060983]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
    [43185.060995]         00 00 03 a8 
    [43185.061000] sd 2:0:0:0: [sdd]  Add. Sense: Scsi parity error
    [43185.061006] sd 2:0:0:0: [sdd] CDB: Read(10): 28 00 00 00 03 a8 00 00 08 00
    [43185.061017] end_request: I/O error, dev sdd, sector 936
    [43185.061023] Buffer I/O error on device sdd, logical block 117
    [43185.061044] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061048] sd 2:0:0:0: killing request
    [43185.061062] ata3: EH complete
    [43185.061075] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061123] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061134] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061140] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061145] sd 2:0:0:0: [sdd] READ CAPACITY(16) failed
    [43185.061147] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    [43185.061152] sd 2:0:0:0: [sdd] Sense not available.
    [43185.061155] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061166] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061175] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061185] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061193] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061198] sd 2:0:0:0: [sdd] READ CAPACITY failed
    [43185.061202] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061209] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    [43185.061215] sd 2:0:0:0: [sdd] Sense not available.
    [43185.061226] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061235] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061245] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061254] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061263] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061274] sd 2:0:0:0: rejecting I/O to offline device
    [43185.061280] sd 2:0:0:0: [sdd] Asking for cache data failed
    [43185.061283] sd 2:0:0:0: [sdd] Assuming drive cache: write through
    [43185.061289] sdd: detected capacity change from 2000398934016 to 0
    [43185.061610] ata3.00: detaching (SCSI 2:0:0:0)
    [43185.062444] sd 2:0:0:0: [sdd] Stopping disk
    [43249.120042] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
    [43249.120046] ata4.00: failed command: FLUSH CACHE EXT
    [43249.120051] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
    [43249.120052]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
    [43249.120054] ata4.00: status: { DRDY }
    [43249.120059] ata4: hard resetting link
    [43249.120060] ata4: nv: skipping hardreset on occupied port
    [43249.610042] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [43249.650323] ata4.00: configured for UDMA/133
    [43249.650326] ata4.00: retrying FLUSH 0xea Emask 0x4
    [43249.650452] ata4.00: device reported invalid CHS sector 0
    [43249.650458] ata4: EH complete
    
  • Answers
  • Everett

    You do have a question here. I think (if I understand correctly) what is the process to determine what is causing this failure?

    I'm a Network Security Engineer. So understand I'm cringing while typing this. Eliminate this as a crypto problem. Decrypt the drives and see if you still have the problem. The downside is you'll need to use them for several months decrypted.

    Cables are a simple test (and you should start there first). Swap them out, but I have a hard time believing that's the problem unless you have neon lights inside your case.

    That leaves the mobo. If it's not the other two...

    I'm sure someone will chime in if they disagree with my troubleshooting. It's not costly to change the cables, and disabling encryption temporarily is a security risk that only you can determine if you're willing to accept.

  • tenner

    It looks like you have a lot of errors on your SATA link. As a result, the host cannot get commands reliably across the link, and when it does sometimes the data returned is corrupted.

    You see that in messages that the speed is limited, or that the expected drive identifier was not received. You are also seeing confusing messages from different layers of the driver which don't necessarily reflect what is going on at the hardware level of SATA. For example, "limiting speed to UDMA/133:PIO3" strictly applies only to parallel ATA drives (it just means the driver is trying a slower interface speed to see if the errors clear up), but the error messages clearly indicate that the lowest level which actually deals with the hardware understands it's talking to a SATA drive.

    Your thought that it might be the SATA cables is a good one. Try replacing them, and make sure they're rated for SATA 3.0 Gb/sec (also called "SATA 2" or "SATA II"). I don't think your drives are the problem. Why does it take several months for the errors to show up after you replace the drive? Maybe the cables are coming loose somehow and replacing the drive reseats them. Or maybe it's just random chance.


  • Related Question

    Does LUKS encryption affect TRIM? (SSD and linux)
  • Algific

    I'm moving over to Linux when the new SSD arrives. SSD gives increased performance, so I thought that I could encrypt everything.

    But then I came to think about TRIM, and garbage collection on the drive. Will a LUKS encrypted drive affect the garbage collection system? (TRIM).


  • Related Answers
  • Algific

    I emailed them. And TRIM will not work. Because the OS doesn't know where files are stored. Only the encrypted system knows it. Due to the fact that the encryption comes first. I'll use truecrypt instead. On top of the file system for my home folder.

  • Zsub

    No. An empty block will still be listed as empty and thus be TRIMed.

    Even if your drive is encrypted, the drive itself knows nothing of the encryption, just where which data is (and which space isn't used at the moment). So it'll be fine.

    As for the performance, I don't know how the impact might be. It would seem that certain optimizations in the SSD might not work, but I cannot figure which ones require knowledge about the actual data so there will probably be no impact from a storage point of view.
    Note that encryption requires extra CPU cycles, so the impact might be noticeable there.

  • ultrasawblade

    Most of the tutorials I've read about setting up LUKS drives ask you to badblocks the entire drive with random data first. This way an attacker cannot know which sectors contain data and which ones haven't been used yet. This information could be used to discover things about the data and correlate with other time-based information which could lead to a compromise.

    So, even if the LUKS modules supported sending groups of unused blocks to TRIM, you wouldn't want to do it anyway.

  • David Foerster

    From man 5 crypttab:

    Options

    discard

    Allow using of discards (TRIM) requests for device.

    WARNING: Assess the specific security risks carefully before enabling this option. For example, allowing discards on encrypted devices may lead to the leak of information about the ciphertext device (filesystem type, used space etc.) if the discarded blocks can be located easily on the device later.

    Kernel version 3.1 or more recent is required. For older versions is the option ignored.