Four Ways to Run Parallel Rsyncs to Speed Up Synchronization

Eleanor Parker

Eleanor Parker

Four Ways to Create Parallel Rsyncs for Faster Synchronization

Rsync was designed in the 90s for small files and a low number of files compared to what today’s businesses move on a regular basis. It was created to synchronize files between two servers, which it still does well to this day. 

But what if you need to sync files across more than two servers? To start, you can script and run multiple instances of Rsync. The scripts you create and support are broadly known as “parallel rsync.”

Below, we’ll look at four ways to launch parallel Rsyncs, including:

  1. GNU Parallel
  2. Multi-stream Rsync (msrsync)
  3. Parasyncfp
  4. Launching multiple rsync sessions

In addition, there are tools like Lsyncd to help with this task. Unfortunately, one of the problems you’ll encounter when you do this is that your data sets will never be fully in sync. 

You’ll also be in charge of building, scripting, troubleshooting, and supporting Rsync, Lsyncd, and any other tool you use to brute-force your way to parallel Rsyncs (not something you want to scale for enterprise business needs). Syncing multiple servers concurrently as fast as possible to keep them all in sync is enough to keep any admin busy, not considering the inevitable Rsync troubleshooting when things break.

If you do get everything working, there are still other hurdles to overcome. Rsync is a single-threaded command-line sync solution; it can’t capture incremental file changes in real time, especially if you have a larger file system with millions of files. It also falls short if you need multi-directional synchronization over WANs.

As a result, syncing large files or large numbers of files with Rsync across systems can take too long to make it a feasible solution (one user found it took almost 15 minutes to transfer just 8GB of data). This problem is increased exponentially when transferring to multiple endpoints. There’s just no way to speed up Rsync enough to solve these problems.

If you’re just syncing files and folders between two offices, consolidating files into one location, or pushing files out (one way only) to a handful of other remote servers, you can use Rsync to keep them reasonably well synced as long as you have a good network. But, for any more complex use cases (or when you need native security features — which Rsync lacks), you’ll need an alternative solution.

Resilio Connect is a powerful alternative to Rsync. To see for yourself how quickly Resilio syncs files of any size/number across multiple endpoints, schedule a demo.

Resilio Platform is designed to handle large replication environments and use cases. It uses a P2P (peer-to-peer) replication architecture which lets every device in your replication environment share data with and receive data from other devices in the network. This enables blazing-fast replication speeds (we’ve seen 100+ Gbps), syncing in any direction, and organic scalability to sync files of any size and number to any number of endpoints. Plus, Resilio’s WAN optimization protocol enables it to quickly and reliably sync files over any network — including high-latency, loss-prone WANs and edge networks.

In contrast to Rsync, Resilio Platform is:

1. Reliable — There’s no single point of failure and no data corruption. Files always sync reliably and on time.

2. Fast — Sync files of any size and type across any number of servers in any direction throughout the world thanks to P2P replication and WAN optimization.

3. Fluidly scalable — Simply add servers to a job to sync files across them.

4. Easy to use — You can automate everything via scripts, APIs, or in the UI. Once it’s set up, it “just works” in the words of Deutsche Aircraft’s IT manager

5. Proven — Well-respected businesses such as Blizzard, Larian StudiosMercedes-Benz, Cisco, Skywalker Sound, Turner Sports, Lindblad Expeditions, Exxon, and more rely on Resilio. Read more on our case studies page.  

6. Well-supported — You don’t need to fuss with managing scripts and supporting them thanks to Resilio’s world-class engineering and support.  

7. Predictable — You know when the job will complete. This consistency is huge when you’re synchronizing hundreds of servers for QA, development, and production. 

8. Transparent — There are no batch files or actions that users need to take. They just work in files and directories as they normally would. Apps use the files as files; it’s easy to integrate with apps and workflows via an API.

9. Secure — Resilio comes with end-to-end AES-256-bit encryption, cryptographic data integrity validation, and more.

In this article, we’ll cover four different ways you can create parallel Rsyncs and share resources you can use to learn more about them. We’ll also take a deep dive into how Resilio Platform replaces the need for parallel Rsyncs and what it can do for you instead.

Organizations such as Bungie, MixHits Radio, Match.com, Deutsche Aircraft, Mercedes-Benz, and more use Resilio Platform to achieve blazing-fast synchronization speeds that can reach over 100 Gbps. For a full demonstration of how Resilio’s capabilities can optimize your application, schedule a demo.

Four Ways to Run Rsync in Parallel for Faster Synchronization

We scoured blogs and forums to find four of the best tools and Rsync scripting workarounds to create parallel Rsync processes. We’ll describe how each method works and share resources you can use to learn more and employ them.

NOTE: Before actually deploying these (or other) Rsyncs, you should always perform a dry-run to see which files will be affected and ensure your file system copies as intended.

1. GNU Parallel

GNU Parallel is a shell tool for Unix and Unix-like operating systems that executes jobs in parallel by splitting the input (which can be lists of files, hosts, users, URLs, and tables) and piping it into commands in parallel.

Put more simply, GNU Parallel acts as a manager, splitting your full list of files into separate smaller Rsync jobs.

You can find and install the latest release of GNU Parallel here.

GNU Parallel commands are split into two parts:

  • The parallel command (which manages the process)

  • The Rsync command (which executes the sync job)

The following is an example of a GNU Parallel command:

parallel -j 7 rsync A/{} /dest-dir/

The -j command is used to set the maximum number of Rsync jobs — in the case above, seven.

2. Multi-Stream Rsync (msrsync)

Parallel Rsync (aka msrsync) is an Rsync wrapper that splits transfers into multiple buckets and transfers them in multiple threads. You can download the wrapper for parallel Rsync on GitHub.

One of the downsides of Parallel Rsync is that it can only be used over LAN connections. But, if you only need to transfer over LANs, Parallel Rsync is a great solution for maximizing bandwidth utilization and copying multiple pieces of data simultaneously.

The following is an example of a Parallel Rsync command:

$ msrsync -p 4 /sender /destination

The -p command sets the maximum number of parallel rsync commands to run between the source directory (sender) and destination directory (destination). 

3. Parasyncfp

Parasyncfp is a Perl script that wraps Rsync and enables parallel threading. It was designed specifically for transferring very large data sets over fast network connections and enables you to limit the bandwidth utilized, if necessary.

The following is an example of a parasync command:

% parsyncfp  --maxload=3.5 --NP=5 --chunksize=2154202 \--startdir='[source directory name]' [destination directory name]

The breakdown of the key parts of this command is as follows:

  • –maxload= sets the bandwidth throttling value
  • –NP= describes the number of parallel Rsync instances to run
  • –chunksize= sets the size of the file chunks for the transfer

The command ends with source and destination directories.

More information and a link to download Parasyncfp can be found here.

4. Launch Multiple Rsync Sessions

A commonly used solution to create parallel processes in Rsync is to launch multiple Rsync instances that run concurrently with different inputs. 

While this can work, it can be complicated. You have to split your directory structure into smaller, equal parts that transfer separately. And you’ll be forced to keep track of each Rsync instance, which can be cumbersome when deployed in large numbers.

One workaround to this is to use the find command and split command to divide your file list into multiple parts for your Rsync — a great description of how to execute this process using bash scripts can be found here.

Resilio Connect: The Best Solution for Fast, Scalable Synchronization

RSync may be a free tool. But the time spent trying to get Rsync to do what you want — i.e., researching Rsync scripts, applying and testing those scripts, trying to figure out what went wrong, and resolving scripting issues — comes at the cost of man hours and frustration. If you’re syncing large files and/or large numbers of files (and especially, if you’re syncing to many endpoints), then Rsync is simply not worth the trouble.

Resilio Platform is a real-time, file synchronization software system that you can use to quickly sync files of any size, type, and number to any number of endpoints concurrently. It uses a P2P replication architecture — where every endpoint in your system can engage in the sync process simultaneously — to sync files 3-10x faster than legacy sync solutions, sync in any direction (one-way, two-way, one-to-many, many-to-one, and N-way), and eliminate single points of failure.

Resilio utilizes a proprietary WAN optimization protocol to enhance transfer over any network, including high-latency WANs, consumer-grade networks, and edge networks.

In this section, we’ll discuss the features and capabilities that make Resilio Platform a superior solution for quickly synchronizing large files and environments, such as:

  • High-performance, real-time sync: The speed and scalability of P2P replication

  • WAN optimization: Fast, reliable transfer over any network

  • Versatility: Supports any IT infrastructure

  • Centralized management: Granular control over your entire sync environment from a single location

  • Native security: Bulletproof protection of your data

High-Performance, Real-Time Sync: The Speed and Scalability of P2P Replication

Both Resilio Platform and Rsync engage in differential sync — i.e., syncing only the changed portions of files.

But Rsync uses a point-to-point replication architecture. In point-to-point replication, files can only be transferred between two devices at a time in one of three models:

  • Hub-and-spoke: This model consists of a hub server and remote devices. The remote devices can’t communicate with each other. All file transfers must first go to the hub server, which then synchronizes those files with the remote devices one by one.

  • Follow-the-sun: In this model, each device syncs with another device sequentially — i.e., Device A syncs with Device B; then Device B syncs with Device C; and so forth.

Point-to-point sync is slowed by the fact that file transfers are limited to two devices at a time. Achieving full synchronization of your environment will take a long time when syncing large files or large numbers of files across many devices. It also introduces single points-of-failure. If any one of the devices or networks goes down, it will impede the sync process as every other device must wait to receive their files.

Resilio Connect’s peer-to-peer sync architecture overcomes all of these limitations. In a P2P topology, every device can share files with and receive files from every other device. When syncing files, every device in your application can work together, allowing you to utilize the full bandwidth of your sync environment and achieve:

A. Blazing Fast Replication Speeds (100+ Gbps per server)

When synchronizing files, Resilio Platform uses a process known as file chunking to split files into several chunks. Each chunk can be transferred independently from the others. As soon as it receives a file chunk, each device can immediately share that chunk even before it receives the rest of the file.

Imagine you want to sync a file across five devices. Resilio Platform will split that file into five chunks. Device 1 can share the first chunk with Device 2. Device 2 can immediately share that first chunk with Device 3 while it waits to receive the other four chunks. With every device working together, you can sync your entire environment 3-10x faster than with traditional solutions like Rsync.

GIF representing P2P vs client server models.

Unlike Rsync, Resilio Platform can also sync files in real time. Resilio utilizes optimized checksum calculations (identification markers assigned to each file that change whenever the file changes) and notification events from the host operating system to immediately detect and transfer file changes. Resilio can also perform manual or scheduled syncs.

B. Sync Files in Any Direction

While it’s possible to run Rsync twice in two directories to approximate a two-way Rsync, Rsync is a tool designed for one-way synchronization. This limits Rsync’s utility for certain use cases, such as disaster recovery.

But Resilio Platform can sync in any direction, such as one-way, two-way, one-to-many, many-to-one, and N-way sync.

N-way sync is especially useful for use cases that require fast synchronization of many endpoints or high availability.

In remote and distributed workforce scenarios, employees can collaborate on the same files from anywhere in the world. Any changes they make to files on their end can immediately sync to every other server/office/workstation, so everyone has access to the most up-to-date versions of files.

In disaster recovery scenarios, you can achieve Active-Active High Availability. Every device in your system can effectively act as a backup server and provide the necessary files or services to your application. In the event of a disaster, every device can work together to bring your application back online — enabling Resilio Platform to meet sub-five-second RPOs (Recovery Point Objectives) and RTOs (Recovery Time Objectives) within minutes of an outage. And with real-time sync, all files and file changes can be backed up immediately.

Hot/Live DR: Multi-site Active/Active; Warm DR: Active/Active; Cold DR: Active/Passive; Offsite Copy: Backup Copy

C. Organically Scale Your Environment

As described earlier, Rsync works well in simple environments. But, as your files and file directories become larger and you add more endpoints to your environment, Rsync becomes too slow and unmanageable to be a useful sync solution. File transfers will take too long, and monitoring/fixing Rsync scripts across a large number of devices isn’t feasible.

Resilio Platform was designed to excel with large sync deployments. It can quickly sync files of any size and number (we’ve tested and successfully synced 450+ million files in a single job).

In a P2P environment, every device can communicate with every other device. So adding more devices only increases the available bandwidth, speed, and resources of your application. More demand yields more supply, and your system scales organically (i.e., there’s no need to invest lots of money in more hardware and failover architectures).

Resilio Platform can easily scale to support:

  • Sync environments of any size: Sync hundreds or thousands of systems simultaneously in the same time most legacy solutions take to sync two.

  • Sync files of any size, type, or number: We’ve tested and successfully synchronized 450+ million files in a single job.

  • Horizontal scale-out replication: Utilize the full bandwidth of your environment to achieve replication speeds of 100+ Gbps per server.

Resilio can sync 50% faster than traditional sync solutions in a 1:2 replication scenario, and 500% faster in a 1:10 scenario.

Plus, our team is always working on making Resilio Platform as efficient as possible. In a recent update, we reduced the average memory footprint required on replication jobs by 80% by optimizing time, merging, CPU usage, indexing, storage I/O, and end-to-end transport.

D. No Single Point-of-Failure

Since every device can receive and transfer files, there is no single point of failure. If any network or device goes down, the necessary files or services can be retrieved from any other device in your environment. Resilio can also perform dynamic rerouting to route around outages and downed networks, so you can always perform backups and reliably deliver files to their destination.

Traditional backup vs Resilio backup.

WAN Optimization: Fast, Reliable Transfer over Any Network

Rsync performs well when synchronizing over LANs and with networks that have good bandwidth. But the transfer protocol it uses doesn’t perform well over WANs. And Rsync is limited by the quality of the network you’re using.

Resilio Platform uses a proprietary WAN acceleration protocol known as Zero Gravity Transport™ (ZGT). ZGT optimizes transfers and enables you to fully utilize any network, regardless of latency, packet loss, and network quality.

ZGT optimizes transfers using:

  • Congestion control: ZGT uses a congestion control algorithm that constantly probes the RTT (Round Trip Time) in order to identify and maintain an ideal data packet send rate.

  • Bulk data transfer: The sending device maintains a uniform packet distribution over time by sending packets periodically with a fixed packet delay.

  • Interval acknowledgments: While other protocols require the receiving device to acknowledge each packet, ZGT uses interval acknowledgments — sending acknowledgments for groups of packets periodically in order to increase transfer speeds.

  • Delayed retransmission: Each acknowledgment contains information about lost packets. ZGT retransmits lost packets in groups to decrease unnecessary retransmissions and reduce network congestion.

  • Checksum restarts: If a file transfer is interrupted, Resilio can perform a checksum restart to resume the file transfer where it left off once back online, rather than restarting the transfer and transmitting redundant data.

  • Dynamic rerouting: If any network or device in your environment goes down, Resilio can dynamically reroute around the outage and find the fastest path to the receiving device — so your data is always delivered to its destination as quickly as possible.
Resilio Platform vs Other WAN Optimizers

By overcoming packet loss and latency, moving data in parallel, and reducing the amount of data sent over the wire, Resilio utilizes 100% of the available bandwidth and increases the overall throughput 2-10x compared to other products.

ZGT also enables Resilio Platform to deliver superior edge sync performance. It works with VSAT, cell (3G, 4G, 5G), Wi-Fi, and any IP connection. It seamlessly adjusts to low-capacity links and provides optional data compression methods for metered connections in order to optimize data transfers at the edge of networks — so you can:

  • Synchronize branch offices located in areas with little/no network coverage.

  • Deploy system updates to a fleet of geographically distributed vehicles and vessels.

  • Collect data from geographically distributed vehicles or remote sites.

  • Quickly distribute time-sensitive operational data to your fleet of vehicles or remote sites.

Versatility: Supports Any IT Infrastructure

Rsync is a solution designed for Linux and Unix-like operating systems. While you can use it on Windows devices, you must install it in a specific way (by, for example, installing Windows Subsystem for Linux).

Resilio Platform is designed with universal standards and open protocols, so it works with:

  • Any device: You can use Resilio with desktops, laptops, mobile devices (Resilio offers an iOS and Android app), and most NAS devices.

  • Most popular operating systems: Resilio Platform works with Windows, Linux, Mac, Unix, Ubuntu, FreeBSD, OpenBSD, and more.

  • Any cloud storage provider: Resilio Platform works with any cloud object storage platform, such as AWS, Google Cloud Platform, Azure Blobs, Cloudian, Ceph, MinIO, VAST Data, Wasabi, Weka IO, and more.

  • Virtual machines: You can use Resilio with Citrix, VMWare, hypervisors, and other virtual machines.

Due to Resilio’s flexible BYO storage, you can install Resilio on your existing IT infrastructure (without the need to invest in extra/proprietary hardware or software) and begin replicating in as little as 2 hours.

Centralized Management & Automation: Granular Control over Your Entire Sync Environment from a Single Location

Even in small, simple environments, managing Rsync can be a chore. But, in large environments with many endpoints, Rsync is unmanageable. You must create and monitor Rsync jobs on each endpoint individually (i.e., no centralized way to oversee each device). And, if something goes wrong with Rsync on any device, you must comb through lots of code to find the problem, then find and deploy the right solution.

Resilio Platform is an agent-based file synchronization software system. After simply installing Resilio agents on each device, you can manage every device and every aspect of the sync process from Resilio’s Management Console:

  • You can control each endpoint in your sync environment from one location, and automate how syncs occur at each endpoint.

  • You can create bandwidth utilization policies for each device, and even create profiles that govern how much bandwidth that device can use at certain times of the day and on certain days of the week.

  • You can adjust replication parameters — such as data hashing, file priorities, buffer size, packet size, disk io threads, and more — to optimize performance and resource utilization.

  • Script any type of functionality your job requires and get real-time replication reports using Resilio’s REST API.

Resilio Platform Overview, General Info, Statistics

Native Security: Bulletproof Protection of Your Data

Although you can use Rsync over SSH (Secure Shell) for encrypted transfers, Rsync offers no native security features and leaves your data vulnerable to corruption and interception.

Resilio Platform includes built-in security features that provide bulletproof protection of your data and eliminate the need to invest in third-party security solutions and VPNs.

Resilio’s state-of-the-art security features were reviewed by 3rd-party security experts. These include:

  • Cryptographic data integrity validation: Resilio uses data validation to ensure your files arrive at their destination uncorrupted.

  • AES-256 bit encryption: Resilio encrypts data at rest and in transit using AES-256.

  • Mutual authentication: Before initiating a transfer, each must provide an authentication key. This ensures your data is only delivered to approved devices.

  • Forward secrecy: Resilio protects your sessions using one-time session encryption keys.

  • File permission controls: Resilio gives you granular control over who can access specific files and folders.

Use Resilio Platform for Fast Transfer & Large Sync Jobs

Resilio Platform is a file synchronization software system that provides blazing-fast synchronization of large datasets through:

  • P2P synchronization: Resilio enables you to utilize the full bandwidth of your environment, sync in any direction, scale organically, and eliminate single points of failure

  • WAN acceleration technology: Resilio provides fast, reliable transfer over any network (no matter how unreliable) — including edge networks in areas with little to no coverage.

  • Versatile deployment: Resilio can be deployed flexibly on any device, operating system, cloud storage provider, and virtual machine.

  • Centralized management and automation: Resilio’s Management Console provides granular control over every aspect of replication from a single, unified location.

  • Built-in security: Resilio includes native security features that keep your data safe and maintain data integrity.

For a full demonstration of how Resilio’s capabilities can optimize your application, schedule a demo.

Overview

Learn four ways to set up parallel Rsyncs and compare their performance with an alternative solution designed for modern-day business use cases.
Related Posts