linux - How can I verify that a 1TB file transferred correctly?

08
2014-07

tbenz9

I frequently transfer VM images from hypervisors to an archive server for long term storage.

I transfer using netcat since it is faster than scp, rsync, ect..

hypervisor$ cat foo.box | nc <archive IP> 1234

archive$ nc -l -p 1234 > foo.box

When the file has finished transferring, I verify there was no corruption by running md5sum on both the target and source.

Unfortunately, running a md5sum on a large file can take a very long time. How can I more quickly compare the integrity of two large files?

Update:

My transmission rarely gets interrupted so restart-ability is not an issue.
It generally takes 3-4 hours to transfer via NC and then 40 minutes to get the md5sum.
The security of the hash is not an issue in this case.

Answers

nerdwaller

You can use tee to do the sum on the fly with something like this (adapt the netcat commands for your needs):

Server:

netcat -l -w 2 1111 | tee >( md5sum > /dev/stderr )

Client:

tee >( md5sum > /dev/stderr ) | netcat 127.0.0.1 1111

Keith Thompson

The openssl command supports several message digests. Of the ones I was able to try, md4 seems to run in about 65% of the time of md5, and about 54% of the time of sha1 (for the one file I tested with).

There's also an md2 in the documentation, but it seems to give the same results as md5.

Very roughly, speed seems to be inversely related to quality, but since you're (probably) not concerned about an adversary creating a deliberate collision, that shouldn't be much of an issue.

You might look around for older and simpler message digests (was there an md1, for example)?

A minor point: You've got a Useless Use of cat. Rather than:

cat foo.box | nc <archive IP> 1234

you can use:

nc <archive IP> 1234 < foo.box

or even:

< foo.box nc <archive IP> 1234

Doing so saves a process, but probably won't have any significant effect on performance.

spuder

Two options:

Use sha1sum

sha1sum foo.box

In some circumstances sha1sum is faster.

Use rsync

It will take longer to transfer, but rsync verifies that the file arrived intact.

From the rsync man page

Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred...

Scott

You probably can’t do any better than a good hash. You might want to check out other hash/checksum functions to see whether any are significantly faster than md5sum. Note that you might not need something as strong as MD5. MD5 (and things like SHA1) are designed to be cryptographically strong, so it is infeasible for an attacker/imposter to craft a new file that has the same hash value as an existing value (i.e., to make it hard to tamper with signed e-mails and other documents). If you’re not concerned about an attack on your communications, but only a run-of-the-mill comms error, something like a cyclic redundancy check (CRC) might be good enough. (But I don’t know whether it would be any faster.)

Another approach is to try to do the hash in parallel with the transfer. This might reduce the overall time, and could definitely reduce the irritation factor of needing to wait for the transfer to finish, and then wait again for the MD5 to finish. I haven’t tested this, but it should be possible to do something like this:

On the source machine:

mkfifo myfifo
tee myfifo < source_file | nc dest_host port_number & md5sum myfifo

On the destination machine:

mkfifo myfifo
nc -l -p port_number | tee myfifo > dest_file &       md5sum myfifo

Of course checking the sizes of the files is a good, quick way to detect if any bytes got dropped.

derobert

Nerdwaller's answer about using tee to simultaneously transfer and calculate a checksum is a good approach if you're primarily worried about corruption over the network. It won't protect you against corruption on the way to disk, etc., though, as its taking the checksum before it hits disk.

But I'd like to add something:

1 TiB / 40 minutes ≈ 437 MiB/sec¹.

That's pretty fast, actually. Remember that unless you have a lot of RAM, that's got to come back from storage. So the first thing to check is to watch iostat -kx 10 as you run your checksums; in particular you want to pay attention to the %util column. If you're pegging the disks (near 100%), then the answer is to buy faster storage.

Otherwise, as other posters mentioned, you can try different checksum algorithms. MD4, MD5, and SHA-1 are all designed to be cryptographic hashes (though none of those should be used for that purpose anymore; all are considered too weak). Speed wise, you can compare them with openssl speed md4 md5 sha1 sha256. I've thrown in SHA256 to have at least one still strong enough hash.

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md4              61716.74k   195224.79k   455472.73k   695089.49k   820035.58k
md5              46317.99k   140508.39k   320853.42k   473215.66k   539563.35k
sha1             43397.21k   126598.91k   283775.15k   392279.04k   473153.54k
sha256           33677.99k    75638.81k   128904.87k   155874.91k   167774.89k

Of the above, you can see that MD4 is the fastest, and SHA256 the slowest. This result is typical on PC-like hardware, at least.

If you want even more performance (at the cost of being trivial to tamper with, and also less likely to detect corruption), you want to look at a CRC or Adler hash. Of the two, Adler is typically faster, but weaker. Unfortunately, I'm not aware of any really fast command line implementations; the programs on my system are all slower than OpenSSL's md4.

So, your best bet speed-wise is openssl md4 -r (the -r makes it look like md5sum output).

If you're willing to do some compiling and/or minimal programming, see Mark Adler's code over on Stack Overflow and also xxhash. If you have SSE 4.2, you will not be able to beat the speed of the hardware CRC instruction.

¹ 1 TiB = 1024⁴ bytes; 1 MiB = 1024² bytes. Comes to ≈417MB/sec with powers-of-1000 units.

Gaurav Joseph

Sending huge files is a pain. Why not try chunking up the files generating a hash for each chunk and then send it over to the destination and then check hash and join up the chunks.

You could also set up a personal BitTorrent network. That would ensure that the whole thing reaches safely.

Home

linux - How can I verify that a 1TB file transferred correctly?