linux - Files with same content but with different md5sums when gzip'd?

08
2014-07
  • Valter Silva

    I don't know why this is happening, but I upload some files to Amazon S3 then delete the sent files checking their md5sum both in Amazon and locally. But recently I found this issue about the same content are generating two different md5sum

    [valter.silva@alog ~]$ ls
    renew.log  s3
    
    [valter.silva@alog ~]$ ls s3/
    renew.log
    
    [valter.silva@alog ~]$ md5sum renew.log 
    d41d8cd98f00b204e9800998ecf8427e  renew.log
    
    [valter.silva@alog ~]$ md5sum s3/renew.log 
    d41d8cd98f00b204e9800998ecf8427e  s3/renew.log
    
    [valter.silva@alog ~]$ gzip renew.log 
    [valter.silva@alog ~]$ gzip s3/renew.log 
    
    [valter.silva@alog ~]$ md5sum renew.log.gz 
    aa1f0ae9a61aac5bcd32b917fbd9324b  renew.log.gz
    
    [valter.silva@alog ~]$ md5sum s3/renew.log.gz 
    6ae0e48edb68e9ed938fdfc3894f6c94  s3/renew.log.gz
    

    Does anybody knows why that's happenning ? Or how should I check if my files are consistent, reliable ?

    update Answering Tiago Cruz answer:

    [valter.silva@alog ~]$ sha1sum renew.log 
    da39a3ee5e6b4b0d3255bfef95601890afd80709  renew.log
    
    [valter.silva@alog ~]$ sha1sum s3/renew.log 
    da39a3ee5e6b4b0d3255bfef95601890afd80709  s3/renew.log
    
    [valter.silva@alog ~]$ gzip renew.log 
    [valter.silva@alog ~]$ gzip s3/renew.log 
    
    [valter.silva@alog ~]$ sha1sum renew.log.gz 
    2d9111d9db71da9fe4de57fbc19c89eb0bd46470  renew.log.gz
    
    [valter.silva@alog ~]$ sha1sum s3/renew.log.gz 
    05014ca24d133f1761f9134e8dab52e6e2111010  s3/renew.log.gz
    

    It gives the same problem Tiago.

  • Answers
  • mpy

    According to RFC 1952, the gzip file header includes the modification time of the original file (field MTIME). You can display the header in plain text1) with gzip -lv renew.log.gz:

    method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
    defla 64263ac7 Jun 21 17:59                 314                 597  52.1% renew.log
    

    So, if you really want to compare the gzip'd files, compress them with the -n option, to not save the original file name and time stamp,

    gzip -n renew.log s3/renew.log 
    

    and their md5sum should be identical.

    Otherwise you could use

    md5sum <(zcat renew.log.gz) <(zcat s3/renew.log.gz)
    

    to calculate the md5sum of the decompressed files.


    1) However, the displayed time and date are not taken from the header, but represent the current values:

    $ gzip renew.log 
    $ mv renew.log.gz foo.gz
    $ gzip -lv foo.gz -------- uncompressed name is taken from current name ---v
    method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
    defla 6c721644 Jul 11 22:34                 580                1586  65.7% foo
    $ hexdump -C foo.gz | head -n 2
    00000000  1f 8b 08 08 f0 16 df 51  00 03 72 65 6e 65 77 2e  |.......Q..renew.|
    00000010  6c 6f 67 00 8d 93 dd 6e  9b 30 18 86 8f 89 94 7b  |log....n.0.....{|
                                                                 ^^^-------^^^^^
                                                      original filename is stored in the header
    
  • Tomas

    Why do you expect compressed version of the same file to be the same? The compress program (gzip) can include some timestamp in the header, or can use some randomized algorithms.

    And exactly! The gzip header contains the timestamp. If you want your compressed files to be the same, your file has to have the same timestamp!

    So, when you copy a file, always do it as cp -p file1 file1, not just cp file1 file2 - that is actually a bad habit!

  • Tiago Cruz

    Just use gzip with '-n' flag:

    tiagocruz@stark:~$ gzip -n Yippie-Ki-Yay.mp3 bla/Yippie-Ki-Yay.mp3 
    
    tiagocruz@stark:~$ sha1sum Yippie-Ki-Yay.mp3.gz bla/Yippie-Ki-Yay.mp3.gz 
    b44b21c5f414935f1ced1187bfafd989704474a5  Yippie-Ki-Yay.mp3.gz
    b44b21c5f414935f1ced1187bfafd989704474a5  bla/Yippie-Ki-Yay.mp3.gz
    

    Source: http://unix.stackexchange.com/questions/31008/why-does-the-gzip-version-of-files-produce-a-different-md5-checksum


  • Related Question

    linux - Why are there binary differences among compressed files generated exactly the same way from the exact same starting file?
  • Christopher Bottoms

    I use the "diff" command to compare two compressed files generated using zip on the exact same starting file and they are reported as being different. However, when I uncompress them and use the "diff" command, no differences are shown. I've noticed this with both zip and gzip.


  • Related Answers
  • cyborg

    You might also like to use zdiff if you do want to compare the compressed contents.

  • Kevin Panko

    One of the fields in the gzip header is different between the two files. One such field is the last modified time of the compressed file (in seconds since 1970), or if the compressed data was not read from a file, then the time when the file was compressed.

    Even a one second difference is enough to make the gzip files not match.

  • JMD

    Two possible causes:

    • different compression algorithm used by the same compression program, or
    • different compression programs
  • jrw32982

    You can use the --no-name gzip option to stop gzip from adding the original file name and the time stamp to the gzip header. That should prevent mismatches when the data is the same, assuming the same compression level is used. One way to add this option to gzip commands is to set the GZIP environment variable, so that that option is used up by every gzip command. For example, in a Bourne-compatible shell such as bash,

    export GZIP="--no-name -6"
    

    or

    export GZIP=--no-name