centos - File synchronization between two Linux servers

08
2014-07
  • Kit Ho

    We have a two web servers running CentOS. We need to sync images that users uploaded.

    No servers should be needed to synchronize those, because we need to handle failover. Also, we need to do a two-way sync.

    We tried rsync and inotify, but both require a server to be set up, so we can't do failover.

    How else can we do that?

  • Answers
  • mtak

    You can just run rsync on both servers:

    server1$ rsync -a -v -e "ssh -c arcfour" user@server2:/path/to/files /path/to/files
    server2$ rsync -a -v -e "ssh -c arcfour" user@server1:/path/to/files /path/to/files
    

    Rsync will only copy files that are not yet on the destination system.


  • Related Question

    nas - Most effective backup software for linux -> linux when dealing with large numbers of files
  • Fake Name

    I have two NASes.
    I work off of one, and the other is used as a backup. As I have it set up now, it's slow. Running a backup takes a week.
    Even for 7 TB, with 1,979,407 files, this seems a bit outlandish, particularly as both systems are RAID-5 and the network is all gigabit.

    I've been digging about in the rsync man pages, and I really don't understand what differentiates the various topologies.
    Right now, all the processing is being done on the backup NAS, which has the main volume from the main NAS mounted locally over SMB. I suspect that the SMB overhead is killing me, particularly when dealing with lots of files.

    I think what I need is to set up rsync on the main nas as a daemon, and then run a local rsync client to connect to it, which would hopefully allow me to completely avoid the whole SMB-in-the-middle affair, but aside from mentioning that it's there, I can find very little information on why one would want to use the daemon mode for rsync.

    Here's my current rsync command line:
    rsync -r -progress --delete /cifs/Thecus/ /mnt/Storage/

    Any input? Is there a better way/tool to do this?

    Edit:
    Ok, to address the additional questions:
    The "Main" NAS is a Thecus N7700. I have additional modules installed that give me SSH, and it has rsync, but it's not in the $PATH, and I havn't figured out how to edit the local $PATH in a way that persists between reboots.
    The "Backup" NAS is a DIY affair, built around a 1.6Ghz Via Mobo with a Adaptec Hardware RAID card. It's running CentOS 5 with a full desktop environment. It's the hardware I'm running rsync from. (Gigabit is through a additional PCI card).

    Further Edit: Ok, got rsync over SSH working (thanks, lajuette!).
    I had to do a bit of tweaking on my command line, I'm running rsync with the args:
    rsync -rum --inplace --progress --delete --rsync-path=/opt/bin/rsync [email protected]:/raid/data/Storage /mnt/Storage
    (Note: I'm specifically not using -a, because I want to change the ownership to the local account, to not freak-out SELinux)

    It seems to be working. I'll see how long it takes.


  • Related Answers
  • lajuette

    You are right: SMB is horribly slow when it comes to lots of files.

    I use rsync myself for syncing my music library.

    rsync -aum --delete /my/music/library/* 192.168.1.5:/backup/of/music/library/
    

    that way i tell rsync to sync via ssh. You need an ssh server running on the target machine (192.168.1.5 in my case) and have rsync installed on both machines.

    Here's an explanation if the options:

    • -a: List item archive all files (include options rlptgoD)
    • -u: update existing files, don't copy them again if they are already in place
    • -m: prune empty dirs
    • --delete: delete files on target which were deleted on source

    inherited through flag -a:

    • -r: recurse through subdirs
    • -l: preserve symlinks as symlinks
    • -p: preserve permissions
    • -t: preserve modification time
    • -g: preserve group
    • -o: preserve owner
    • -D: preserve device and special files

    This should sync your NAS quite fast. If you try it, please post your results!

  • Ignacio Vazquez-Abrams

    rsync running as a daemon is unsecured, so it's really only useful for stores that you want to make publicly accessible. The way to do it is to get ssh working on the NAS so that you can rsync to nas-device:/path/to/storage directly, then from there you can tweak the ssh settings to optimize then.

  • MattBianco

    What kind of NAS:es are these? Are you running rsync on the embedded CPU? Perhaps it's the CPU that is the bottleneck here.
    Do you know what the internal filesystem is on the NAS:es? Are there millions of files in the same directory?

    If you have Gigabit network from both NAS:es, and have them both mounted on your linux box (with smbmount or NFS) it shouldn't be that slow to sync it with rsync, which I believe is the best option for syncing large amounts of data, like you do. Just try to figure out where the bottleneck is first. Then it will be much easier to find a better solution.

  • Tullis

    Do you know about using hard-links to create space efficient, point-in-time backups?

    Here's an article about it. http://www.mikerubel.org/computers/rsync_snapshots/

    As you're using rsync you're halfway there already, but it could be a useful addition to your existing system.

    Essentially, you can store many, many copies of your source data. Each of them looks like a full directory structure, but files which don't change between versions share the same inodes on the disk(s). Although the simplest solution is to use rsync with the --link-dest parameter, as outlined above, the technique is also implemented in other backup software, such as:

    • backuppc :: backuppc.sourceforge.net
    • back-in-time :: backintime.le-web.org
    • rsnapshot :: rsnapshot.org (Haven't personally used this one)
  • Thomas

    In case you still have problems, or for others reading this, I recommend looking into the following rsync options (in addition to the ones mentioned by lajuette, like the immensely useful -u):
    -z (compress: Unless your network is much faster than your CPU, this may save time, but you can test that)
    --partial-dir='.rsync-partial' (in case the connection craps out and you were just transferring a 7 gigabyte movie file, you can continue where you left off, rather than restart; I consider --inplace, which is incompatible with this, as too dangerous)
    -v (verbose mode - only for testing/trouble-shooting)
    --exclude-from='your-exclude-list-file' (if you have backup files, system files, thumbnail images, temporary/cache files, certain dirs, etc. you don't need to back up, etc., list them in the exclude file with optional wild cards; this may reduce the volume)

    The --delete option is very dangerous and should be used with great caution, because if you accidentally delete one or more files and before you realize it you do your backup (e.g. via a cron job), then your backed-up copy is gone, too.

    The PATH variable should be set/modified in your $HOME/.profile file, this gets run whenever you log in.

    Apart from that I second MattBianco's suggestion of trying to find the bottle neck first.

    Hope this helps.