What's the percentage of having two files with the same byte size length giving the same md5 hash?

08
2014-07
  • CuSS

    I'm developing an app that will store a lot of files, and in case of images, it will resize them and save the different thumbnails we need, so when a user uploads an image, it will save more 8 files (this is needed).

    To avoid duplicated files and to save space on my static hosting server, my app is saving the file name as "MD5.BYTE_SIZE" (ex: 054d995efa7e9c91569d205d24a2b486.188154)

    I've used this file scheme already on another clients without any problems, but I need to know, specifically to this project, if there exists the possibility of the user sending a file with the same MD5, and same size length.

    If so, what's the best way to save my file names? With two different hashes (like MD5.SHA-256.BYTE_SIZE)?

  • Answers
  • Jan Schejbal

    For practical purposes, zero, unless the user actively tries to create two files that have the same hash, which is possible with MD5.

    If you use SHA-256 instead, "zero" (for practical purposes) even if the user actively tries to create two files with the same size.

    The exact probability is somewhere around 1/2^128 for two different files to generate the same hash. Due to the birthday paradoxon, you would need around 2^64 files until there is a 50% chance that two will have the same hash. Do not worry about it in practice. For SHA256, the numbers are 1/2^256 and 2^128, respectively. These numbers are also known as "not going to happen".

  • Austin ''Danger'' Powers

    It is theoretically possible, but in reality the chance of two different files having the same MD5 checksum is vanishingly small.

    So small, in other words, that you can essentially treat this event as impossible as far as your program is concerned.


  • Related Question

    hashing - Can you use OpenSSL to generate an md5 or sha hash on a directory of files?
  • Kieveli

    I'm interested in storing an indicator of file / directory integrity between two archived copies of directories. It's around 1TB of data stored recursively on hard drives. Is there a way using OpenSSL to generate a single hash for all the files that can be used as a comparison between two copies of the data, or at a later point to verify the data has not changed?


  • Related Answers
  • AaronLS

    You could recursively generate all the hashes, concatenate the hashes into a single file, then generate a hash of that file.

  • John T

    You can't do a cumulative hash of them all to make a single hash, but you can compress them first then compute the hash:

    $tar -czpf archive1.tar.gz folder1/
    $tar -czpf archive2.tar.gz folder2/
    $openssl md5 archive1.tar.gz archive2.tar.gz
    


    to recursively hash each file:

    $find . -type f -exec openssl md5 {} +
    
  • Rudedog

    Doing a md5 sum on the tar would never work unless all of the metadata (creation date, etc.) was identical as well, because tar stores that as part of its archive.

    I would probably do an md5 sum of the contents of all of the files:

    find folder1 -type f | sort | tr '\n' '\0' | xargs -0 cat | openssl md5
    find folder2 -type f | sort | tr '\n' '\0' | xargs -0 cat | openssl md5