linux - Delete duplicate files with rsync

04
2013-09
  • codedme

    Here is the thing ,

    I have a folder on my server with 50 GB size and its' containing over 60000 file. I used rsync to transfer it to a mirror server and almost half of file has been transferred.Now i want to delete transferred file on main server.

    Can this be done with rsync? I did read the help and find --delete option but these files are so important so i want to have an expert opinion thanks.

  • Answers
  • Florian Feldhaus

    rsync (checked with version 3.0.9) has an option called --remove-source-files which does what it says. If you only want to delete the transferred files and not transfer additional files which have not yet been transferred, you need to additionally use the option `--existing``.

    Unfortunately it seems that rsync doesn't output which files it is deleting even if options --verbose --itemize-changes --stats are used.

    Example

    # create source and target dirs
    mkdir /tmp/source
    mkdir /tmp/target
    # create a test file in source
    touch /tmp/source/test
    # rsync source and target
    rsync --archive --itemize-changes --verbose --stats /tmp/source/ /tmp/target
    # verify that test has been copied to target
    [ -f /tmp/target/test ] && echo "Found" || echo "Not found"
    # create another file in source
    touch /tmp/source/test2
    # delete files on source which are already existing on target
    rsync --archive --itemize-changes --verbose --stats --remove-source-files --existing /tmp/source/ /tmp/target
    # verify that test has been deleted on source
    [ -f /tmp/source/test ] && echo "Found" || echo "Not found"
    # verify that test2 still exists on source and was not transferred to target
    [ -f /tmp/source/test2 ] && echo "Found" || echo "Not found"
    [ -f /tmp/target/test2 ] && echo "Found" || echo "Not found"
    
  • JvO

    As written before, rsync will not delete from the source, only on the destination.

    In your case, I would generate MD5 hashes of the files on the mirror server, then check on the primary server if the hashes are correct and remove those files.

    I.e.:

    mirror$ find . -type f -print0 | xargs -0 md5sum > mirror.md5

    ..transfer mirror.md5 to primary server...

    primary$ md5sum -c mirror.md5

    Check for any FAILED files, then remove the files that have been transfered succesfully. You could automate it like this:

    md5sum -c mirror.md5 | grep 'OK$' | sed -e 's/: OK$//' | while read FILE; do rm "$FILE"; done

    This will filter all files with a good hash, chop off the 'OK' part from md5sum and remove the files one by one.

    Needless to say, after this you don't want to use the --delete option from rsync to transfer the second half of your files...


  • Related Question

    linux - how to exclude rsync excludes from delete
  • Rob

    Hey im using rsync to sync files across multiple machines.

    using the following:

    rsync -az -e "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
    --delete --delete-excluded --force --exclude=.git  --exclude=.bundle \
    --exclude=tmp --exclude=log/* --exclude=*.log --exclude=*.pid \
    user@host:/path/to/src/ /var/build/dest
    

    I want to exclude all log files from being transferred from the src to dest and delete all existing ones on the destination so im using --exclude=*.log with --delete-excluded which works great ...

    but i want to keep a certain log file intact on the destination. I want a --exclude-from-delete option

    Is this possible with rsync?
    TIA

    --Rob


  • Related Answers
  • koniiiik

    The "protect" filter should accomplish just what you want:

              protect, P specifies a pattern for protecting files from deletion.
    

    Just specify the following filter before the relevant excludes:

    --filter='P my-specific-logfile.log'
    

    (Notice the space after the letter P.)

  • mrucci

    Yes, just use the option:

    --include=PATTERN
    

    before the exclude options.