Parallelise rsync using GNU Parallel

Question

I have been using a rsync script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.

In order to sync those files, I have been using rsync command as follows:

rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/

The contents of proj.lst are as follows:

+ proj1
+ proj1/*
+ proj1/*/*
+ proj1/*/*/*.tar
+ proj1/*/*/*.pdf
+ proj2
+ proj2/*
+ proj2/*/*
+ proj2/*/*/*.tar
+ proj2/*/*/*.pdf
...
...
...
- *

As a test, I picked up two of those projects (8.5GB of data) and executed the command above. Being a sequential process, it took 14 minutes and 58 seconds to complete. So, for 1.2TB of data, it would take several hours.

If I would could have multiple rsync processes in parallel (using &, xargs or parallel), it would save me time.

I tried with below command with parallel (after cding to the source directory) and it took 12 minutes 37 seconds to execute:

parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .

This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere.

How can I run multiple rsync processes in order to reduce the execution time?

Are you limited by network bandwidth? Disk iops? Disk bandwidth? — Ole Tange
– Ole Tange, Commented Mar 13, 2015 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiple rsyncs is our first priority. — Mandar Shinde
– Mandar Shinde, Commented Mar 13, 2015 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used? — Ole Tange
– Ole Tange, Commented Mar 13, 2015 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiple rsyncs in parallel is the primary focus now. — Mandar Shinde
– Mandar Shinde, Commented Mar 13, 2015 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk). — xenoid
– xenoid, Commented Nov 22, 2018 at 15:55

Mikhail · Accepted Answer · 2021-12-14 23:31:41Z

I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rsync operations.

I have a large zfs volume and my source was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1.

The source drive was mounted like:

mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0

Using a single rsync process:

rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod

the io meter reads:

StoragePod  30.0T   144T      0  1.61K      0   130M
StoragePod  30.0T   144T      0  1.61K      0   130M
StoragePod  30.0T   144T      0  1.62K      0   130M

This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.

So, I built the file list and tried to run the sync again (I have a 64 core machine):

cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log

and it had the same performance!

StoragePod  29.9T   144T      0  1.63K      0   130M
StoragePod  29.9T   144T      0  1.62K      0   130M
StoragePod  29.9T   144T      0  1.56K      0   129M

As an alternative I simply ran rsync on the root folders:

rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell

This actually boosted performance:

StoragePod  30.1T   144T     13  3.66K   112K   343M
StoragePod  30.1T   144T     24  5.11K   184K   469M
StoragePod  30.1T   144T     25  4.30K   196K   373M

In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.

Pablo A · Accepted Answer · 2024-05-07 11:50:13Z

24

Following steps did the job for me:

Run the rsync --dry-run first in order to get the list of files those would be affected.

$ rsync -avzm --stats --safe-links --ignore-existing --dry-run \
    --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log

I fed the output of cat transfer.log to parallel in order to run 5 rsyncs in parallel, as follows:

$ cat /tmp/transfer.log | \
    parallel --will-cite -j 5 rsync -avzm --relative \
      --stats --safe-links --ignore-existing \
      --human-readable {} REMOTE-HOST:/data/ > result.log

Here, --relative option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/ directory), so the command must be run in the source folder (in example, /data/projects).

edited May 7, 2024 at 11:50

Pablo A

3,2651 gold badge26 silver badges46 bronze badges

answered Apr 11, 2015 at 13:53

Mandar Shinde

3,37411 gold badges43 silver badges59 bronze badges

12

That would do an rsync per file. It would probably be more efficient to split up the whole file list using split and feed those filenames to parallel. Then use rsync's --files-from to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/

Sandip Bhattacharya
– Sandip Bhattacharya

2016-11-17 21:22:27 +00:00
Commented Nov 17, 2016 at 21:22
2

On newer versions of rsync (3.1.0+), you can use --info=name in place of -v, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.

Cheetah
– Cheetah

2017-10-12 05:31:52 +00:00
Commented Oct 12, 2017 at 5:31

Add a comment |

Iain4D · Accepted Answer · 2025-07-26 14:34:40Z

22

I personally use this simple example:

$ ls -1 | parallel rsync -a {} /destination/directory/

Which is only useful when you have more than a few non-near-empty directories, else you'll end up having almost every rsync terminating and the last one doing all the job alone.

Note the backslash before ls, which causes aliases to be skipped. Thus, ensuring that the output is as desired.

edited Jul 26 at 14:34

Iain4D

235 bronze badges

answered May 25, 2016 at 14:15

Julien Palard

6397 silver badges9 bronze badges

How does this deal with filenames with (for example) \n in them?

Tom Hale
– Tom Hale

2021-02-20 12:13:39 +00:00
Commented Feb 20, 2021 at 12:13
2

@TomHale hopefully it reaches through time and space to beat the crud out of the person who made such a file...

RonJohn
– RonJohn

2022-08-02 22:11:45 +00:00
Commented Aug 2, 2022 at 22:11
@RonJohn absolutely agree

Christopher Thomas
– Christopher Thomas

2022-09-04 08:20:29 +00:00
Commented Sep 4, 2022 at 8:20

Add a comment |

Ole Tange · Accepted Answer · 2023-04-11 11:14:37Z

8

A tested way to do the parallelized rsync is: https://www.gnu.org/software/parallel/parallel_examples.html#example-parallelizing-rsync

rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.

The following will start one rsync per big file in src-dir to dest-dir on the server fooserver:
cd src-dir; find . -type f -size +100000 | \
parallel -v ssh fooserver mkdir -p /dest-dir/{//}\; \
  rsync -s -Havessh {} fooserver:/dest-dir/{} 
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/ 

If you are unable to push data, but need to pull them and the files are called digits.png (e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/

edited Apr 11, 2023 at 11:14

answered Mar 13, 2015 at 7:25

Ole Tange

37.6k34 gold badges119 silver badges228 bronze badges

2

cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}\; rsync -s -Havessh {} fooserver:/dest-dir/{}

Ole Tange
– Ole Tange

2015-04-10 05:51:40 +00:00
Commented Apr 10, 2015 at 5:51
This has a newer version at gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync, using one pass for all files and parallel -j10 (where 10 is the number of parallel jobs).

Gabor Szarnyas
– Gabor Szarnyas

2022-04-21 10:49:43 +00:00
Commented Apr 21, 2022 at 10:49

Add a comment |

ingopingo · Accepted Answer · 2017-04-10 06:37:20Z

4

For multi destination syncs, I am using

parallel rsync -avi /path/to/source ::: host1: host2: host3:

Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys

answered Apr 10, 2017 at 6:37

ingopingo

8075 silver badges7 bronze badges

Add a comment |

GuyPaddock · Accepted Answer · 2022-01-29 17:24:31Z

A more recent option to consider is using Fpsync, which wraps rsync but should be more efficient than launching an rsync process-per-file because it operates on "partitions" -- batches of files. It also starts copying immediately -- while still walking the directory tree -- instead of waiting for the crawl to finish.

Here's an example:

fpsync -vvv -o '-avm --numeric-ids --safe-links' -n 10 <SOURCE PATH> <DEST PATH>

I wish it provided nicer output... you actually get no output without at least specifying -v, but it does seem to maximize throughput for transfers.

Here's the full manpage: http://manpages.ubuntu.com/manpages/bionic/man1/fpsync.1.html

Sebastian Vaisov · Accepted Answer · 2018-11-23 08:58:38Z

2

I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel. I ended up using this one-liner to sync multiple folders:

find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'

-P 5 is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).

--bwlimit to avoid using all bandwidth.

-I % argument provided by find (directory found in dir/)

$(echo dir/%/ host:/dir/%/) - prints source and destination directories which are read by rsync as arguments. % is replaced by xargs with directory name found by find.

Let's assume I have two directories in /home: dir1 and dir2. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'. So rsync command will run as two processes (two processes because /home has two directories) with following arguments:

rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/

edited Nov 23, 2018 at 8:58

answered Nov 22, 2018 at 15:43

Sebastian Vaisov

212 bronze badges

The $(echo ...) part seems to be unnecessary. xargs is already templating that part anyways.

Victor Schröder
– Victor Schröder

2020-06-14 14:30:12 +00:00
Commented Jun 14, 2020 at 14:30
Ok, I just tried this solution and it is very close, but not quite yet. find will find all the directories in a recursive manner, meaning that using rsync with -a will produce jobs that overlap (because xargs will produce a job for each each level of the directory tree). To fix that is easy, one just need to add --no-r -d to the rsync command. Also, it is very likely that directory creation on the destination has to be handled a priori.

Victor Schröder
– Victor Schröder

2020-06-14 19:45:53 +00:00
Commented Jun 14, 2020 at 19:45
This is broken if any file has a space or other special character in the name - add -print0 to the find, -0 to the xargs, and appropriate quotes where the file substitutions are. Also need to do -depth 1 on the find to just go 1 level down (so no clashes)

Mr R
– Mr R

2021-03-30 00:58:08 +00:00
Commented Mar 30, 2021 at 0:58

Add a comment |

Peregrino69 · Accepted Answer · 2023-04-09 12:13:02Z

SRC="mysrc"
DST="mydst"
NUM_THREADS="4"  # Adjust this value based on your system's resources

# Create a list of files to transfer, excluding directories
find "$SRC" -type f -exec realpath --relative-to="$SRC" '{}' \; > files_to_transfer.txt

# Create a list of directories to transfer
find "$SRC" -type d -exec realpath --relative-to="$SRC" '{}' \; > dirs_to_create.txt

# Create directories in the destination
cat dirs_to_create.txt | while read -r dir; do mkdir -p "$DST/$dir"; done

# Use GNU Parallel to run multiple rsync instances in parallel
cat files_to_transfer.txt | parallel -j "$NUM_THREADS" --quote rsync -av "$SRC"/{} "$DST"/{}

This assumes the target is mounted somehow. If not, you'll need to replace the "mkdir -p" line with something else.

cat dirs_to_create.txt | while read -r dir; do mkdir -p "$DST/$dir"; done

could be replaced with

cat dirs_to_create.txt | while read -r dir; do ssh $REMOTE_HOST "mkdir -p \"$DST/$dir\""; done

for example. If there are many many dirs this maybe should be transferred as a file and executed on the remote end.

Stack Exchange Network

Parallelise rsync using GNU Parallel

8 Answers 8

You must log in to answer this question.

Linked

Hot Network Questions

Parallelise rsync using GNU Parallel

8 Answers 8

You must log in to answer this question.

Linked

Related

Hot Network Questions