linux - How can I verify that a 1TB file transferred correctly?

08
2014-07
  • tbenz9

    I frequently transfer VM images from hypervisors to an archive server for long term storage.

    I transfer using netcat since it is faster than scp, rsync, ect..

    hypervisor$ cat foo.box | nc <archive IP> 1234
    
    archive$ nc -l -p 1234 > foo.box
    

    When the file has finished transferring, I verify there was no corruption by running md5sum on both the target and source.

    Unfortunately, running a md5sum on a large file can take a very long time. How can I more quickly compare the integrity of two large files?

    Update:

    • My transmission rarely gets interrupted so restart-ability is not an issue.
    • It generally takes 3-4 hours to transfer via NC and then 40 minutes to get the md5sum.
    • The security of the hash is not an issue in this case.
  • Answers
  • nerdwaller

    You can use tee to do the sum on the fly with something like this (adapt the netcat commands for your needs):

    Server:

    netcat -l -w 2 1111 | tee >( md5sum > /dev/stderr )
    

    Client:

    tee >( md5sum > /dev/stderr ) | netcat 127.0.0.1 1111
    
  • Keith Thompson

    The openssl command supports several message digests. Of the ones I was able to try, md4 seems to run in about 65% of the time of md5, and about 54% of the time of sha1 (for the one file I tested with).

    There's also an md2 in the documentation, but it seems to give the same results as md5.

    Very roughly, speed seems to be inversely related to quality, but since you're (probably) not concerned about an adversary creating a deliberate collision, that shouldn't be much of an issue.

    You might look around for older and simpler message digests (was there an md1, for example)?

    A minor point: You've got a Useless Use of cat. Rather than:

    cat foo.box | nc <archive IP> 1234
    

    you can use:

    nc <archive IP> 1234 < foo.box
    

    or even:

    < foo.box nc <archive IP> 1234
    

    Doing so saves a process, but probably won't have any significant effect on performance.

  • spuder

    Two options:

    Use sha1sum

    sha1sum foo.box
    

    In some circumstances sha1sum is faster.


    Use rsync

    It will take longer to transfer, but rsync verifies that the file arrived intact.

    From the rsync man page

    Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred...

  • Scott

    You probably can’t do any better than a good hash.  You might want to check out other hash/checksum functions to see whether any are significantly faster than md5sum.  Note that you might not need something as strong as MD5.  MD5 (and things like SHA1) are designed to be cryptographically strong, so it is infeasible for an attacker/imposter to craft a new file that has the same hash value as an existing value (i.e., to make it hard to tamper with signed e-mails and other documents).  If you’re not concerned about an attack on your communications, but only a run-of-the-mill comms error, something like a cyclic redundancy check (CRC) might be good enough.  (But I don’t know whether it would be any faster.)

    Another approach is to try to do the hash in parallel with the transfer.  This might reduce the overall time, and could definitely reduce the irritation factor of needing to wait for the transfer to finish, and then wait again for the MD5 to finish.  I haven’t tested this, but it should be possible to do something like this:

    • On the source machine:

      mkfifo myfifo
      tee myfifo < source_file | nc dest_host port_number & md5sum myfifo
      
    • On the destination machine:

      mkfifo myfifo
      nc -l -p port_number | tee myfifo > dest_file &       md5sum myfifo
      

    Of course checking the sizes of the files is a good, quick way to detect if any bytes got dropped.

  • derobert

    Nerdwaller's answer about using tee to simultaneously transfer and calculate a checksum is a good approach if you're primarily worried about corruption over the network. It won't protect you against corruption on the way to disk, etc., though, as its taking the checksum before it hits disk.

    But I'd like to add something:

    1 TiB / 40 minutes ≈ 437 MiB/sec1.

    That's pretty fast, actually. Remember that unless you have a lot of RAM, that's got to come back from storage. So the first thing to check is to watch iostat -kx 10 as you run your checksums; in particular you want to pay attention to the %util column. If you're pegging the disks (near 100%), then the answer is to buy faster storage.

    Otherwise, as other posters mentioned, you can try different checksum algorithms. MD4, MD5, and SHA-1 are all designed to be cryptographic hashes (though none of those should be used for that purpose anymore; all are considered too weak). Speed wise, you can compare them with openssl speed md4 md5 sha1 sha256. I've thrown in SHA256 to have at least one still strong enough hash.

    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    md4              61716.74k   195224.79k   455472.73k   695089.49k   820035.58k
    md5              46317.99k   140508.39k   320853.42k   473215.66k   539563.35k
    sha1             43397.21k   126598.91k   283775.15k   392279.04k   473153.54k
    sha256           33677.99k    75638.81k   128904.87k   155874.91k   167774.89k
    

    Of the above, you can see that MD4 is the fastest, and SHA256 the slowest. This result is typical on PC-like hardware, at least.

    If you want even more performance (at the cost of being trivial to tamper with, and also less likely to detect corruption), you want to look at a CRC or Adler hash. Of the two, Adler is typically faster, but weaker. Unfortunately, I'm not aware of any really fast command line implementations; the programs on my system are all slower than OpenSSL's md4.

    So, your best bet speed-wise is openssl md4 -r (the -r makes it look like md5sum output).

    If you're willing to do some compiling and/or minimal programming, see Mark Adler's code over on Stack Overflow and also xxhash. If you have SSE 4.2, you will not be able to beat the speed of the hardware CRC instruction.


    1 1 TiB = 1024⁴ bytes; 1 MiB = 1024² bytes. Comes to ≈417MB/sec with powers-of-1000 units.

  • Gaurav Joseph

    Sending huge files is a pain. Why not try chunking up the files generating a hash for each chunk and then send it over to the destination and then check hash and join up the chunks.

    You could also set up a personal BitTorrent network. That would ensure that the whole thing reaches safely.


  • Related Question

    unix - Verifying successful file transfer after moving lots of large files
  • amdfan

    I'm working on transferring a lot of files (>100GB, several thousands of files) over my network to a new Mac. Once the transfers are done I'd like to be able to verify that all of the files were successfully transferred and that no corruption occurred during the process.

    The files are coming from a FreeNAS (open source NAS based on FreeBSD) server share. They're being loaded onto the Mac file system.

    So far the best solution I can think of is running ls -aR into a file for the share and then the local disk, then diffing the two files. Are there better solutions? Optionally, but even better, is there a way to do this which would hash the files to make sure that all the data was successfully transferred?

    In terms of my computer skills, I'm comfortable using terminal applications so there's no need to recommend only GUI tools.


  • Related Answers
  • SleighBoy

    One word. rsync.

  • larsks

    As SleighBoy mentioned, Rsync is the standard tool for this sort of thing.

    Your suggestion for diffing the output of ls on both systems wouldn't work, because while it would verify that the files were present in both locations it would do nothing to verify the integrity of the data. If you were to do this manually, you would need to generate a checksum for each file on both systems and then verify that the checksums match. There are tools out there to do this for you (radmind has tools that make this easy), but it's easier to just use Rsync.

    This article describes a clever use of BitTorrent to achieve roughly the same thing. I don't recommend this solution, but it's an interesting read.

  • dlamblin

    If you have the files on both sides, you can run md5sum on each end and compare the hash. The way to do this on a bunch of files is to tar them to stdout and pipe the output to md5sum.

    rsync and plain scp -r or rcp -r are also your friends.