Introducing Scale-out File Replication and Sync

Konstantin

Konstantin

Resilio’s new high-performance file replication technology is ready for a test drive on your fastest network

Resilio is excited to announce a breakthrough in high-performance “scale-out” data transfer and synchronization that helps data-intensive organizations move massive files—and many millions of files—of any size and type, as fast as your fastest networks. The innovation enables more data to be moved in parallel across any type of network—across any distance. All while maintaining high availability and eliminating single points of failures (SPOFs) during the transfer process. 

Recent test results showed transferring a 1TB file across Azure regions within 90 seconds!

Background

As network speeds increase and data continues to explode, file transfer and replication speeds continue to be top of mind for many IT professionals. Today, we’re excited to share some news  about a major breakthrough in commercial high-speed data movement for ingest, distribution, and synchronization: Resilio Platform scale-out file replication.   

Scale-out, or horizontal scalability, is the ability to add nodes to a cluster to increase performance. The technical concept centers around the idea of pooling network resources to increase speeds linearly. 

In a Resilio scale-out peer-to-peer file replication architecture, each node in the environment runs a Resilio Platform Agent. This cluster of nodes (also known as a swarm in P2P vernacular) work together to transfer or synchronize data. 

Why Did Resilio Develop Scale-Out? 

Over the past few years, Resilio has been focused on end-to-end performance optimization, especially around speeding up the synchronization process for larger data sets. 

Resilio Platform 3.0, released last year, was all about scale. Our engineering team improved memory efficiency by 85% over earlier releases. Moreover, the process of reading, indexing, and merging differences across hundreds of millions of files in parallel is a complex process to say the least. Resilio validated synchronizing 400 million files with a single agent. 

A fundamental barrier to faster performance is the physical host systems — the endpoints running Resilio agents. No matter how much you scale up by increasing memory, CPU, throughput, and storage IO, the individual host system (the endpoint) will only go so fast. In some cases—depending on the workload—the endpoint system becomes a bottleneck to faster transfer and synchronization performance.  

For example, a modern x64 Linux system (virtual or physical) does a good job scaling memory, CPU cores, IO and throughput. So with a traditional point-to-point tool like rsync or Aspera Sync, you’re never going to go beyond what that single system can do in terms of raw throughput. With parallel rsync and tools like GridFTP, you can run multiple one-way transfers and sync jobs, but your end-to-end performance of an individual transfer or sync process will always be limited to (2) computer endpoints.  

As networks become faster — to 100 Gbps and beyond — and unstructured data sets exponentially larger, the bottleneck shifts away from the network and to the file replication architecture. While peer-to-peer is able to distribute aspects of the workload across multiple peers, a new paradigm to more fully utilize multiple endpoints was needed.    

Resilio’s new high-performance file replication technology is ready for a test drive on your fastest network

So…how can you achieve large-scale transfers and synchronization jobs across multiple endpoints as efficiently as possible—and within practical timeframes?   

With the development of scale-out, Resilio’s design goals were to:

  • Leverage our peer-to-peer (P2P) architecture and all of the benefits it provides (e.g., high availability, no SPOFs, infrastructure flexibility)
  • Build on recent performance optimizations in memory utilization to enable transferring and synchronizing hundreds of millions of files per job. 
  • Achieve linear scalability of data transfer performance for real-world data sets, including large files and many multiple-sized files. 
  • Test performance using commodity “off the shelf” x64 systems.   

Resilio’s chief product officer, Alan Hannan, says “Resilio’s scale-out provides a horizontally scalable method to deliver files and data between locations faster than native applications allow.  As a particular data point, with 25 commodity VMs machines on each side, we can move a dataset at over 100Gbps. This frees up technical, business, and creative employees to work on the data instead of having to wait for the data.”

Resilio Scale-Out is Unique From Other Approaches

Most traditional replication solutions are limited by point-to-point (client/server) architectures. That means a transfer or synchronization job occurs between only two (2) computer systems at once. For organizations in high performance computing (HPC) and other industries requiring faster data transfers and sync jobs, there are some tricky ways that people currently use to get around point-to-point architectures – like GridFTP, UDP blasters, and even parallelized rsync can be scripted to “follow the sun” and move multiple files by brute force in parallel. But there is always a one-to-one relationship between an endpoint and a transfer job.     

Other approaches to “scale-out” such as running Aspera on an EMC Isilon cluster or Signiant on multiple high-speed nodes doesn’t help synchronize an individual file any faster—and also limits the use case to a “one-way” sync like rsync or at best two-way sync with caveats.  These point-to-point architectures require manual partitioning of the data sets—to spread out the load across multiple nodes.  Resilio, by contrast, automatically distributes the load across multiple nodes. 

Thus, traditional point-to-point approaches to “scale-out” don’t work well for workloads such as: 

  • Transferring and synchronizing a single massive file faster than a single system  
  • Transferring and synchronizing many files to many locations in parallel 
  • Efficiently transferring and synchronizing files at higher network speeds without sending unnecessary or redundant file copies across the network
  • Obtaining full predictability of multi-node transfers and synchronization jobs across networks of varying qualities and speeds. 

Resilio scale-out provides the ability to obtain full utilization of higher speed interconnects, from 10-100 Gbps and beyond. Resilio will give data-intensive organizations—from multiple HPC facilities to microscopy centers to commercial biotech companies—the ability to move massive data sets across any type of network, at unprecedented fixed and predictable speeds. 

Scalability Advantages of the Resilio Platform

Resilio has developed a scale-out file replication architecture where multiple nodes running industry-standard operating systems (such as Windows and Linux) and a Resilio Agent are able to replicate a single file or multiple files from a common source to a target destination in parallel. 

This offers tremendous flexibility in scaling out as needed based on the target throughput and/or timeframe to completion of an ingest, distribution, or synchronization job. Resilio scale-out uses intelligent node groupings and file chunking to optimize replication performance, resulting in faster transfer speeds and reduced latency. Files are transferred and replicated as quickly and efficiently as possible, while ensuring data integrity and security. Works with any type of IP network.

Some of the core Resilio Platform capabilities complementing scale-out include:

  • Use any type of storage—the faster the better.  Use file, block, or object storage. 
  • Use industry-standard x64 systems running your choice of OS in a physical or virtual system (VM).  Whatever your per-node performance, Resilio makes it easy to add nodes to increase performance. 
  • Payloads can be of any size and type—from a very large single file to many smaller to medium size files. Our results show sustained performance for both single very large files and multiple smaller files.  
  • Automation:  The process of scaling out and back can be fully automated through the user interface, via scripting, or through a well documented API set.  

Other capabilities that also benefit scale-out include: 

  • P2P architecture: All nodes work collectively and automatically to distribute the payload as fast as possible given the total number of nodes and available bandwidth. 
  • WAN optimization: Resilio overcomes latency and packet loss over distance through a proprietary UDP-based Zero Gravity Transport (ZGT). 
  • No single point of failure: If a node fails during a transfer, another node picks up. 
  • Centralized management: With Resilio, organizations centrally manage their entire file replication and synchronization infrastructure from a single, centralized console. This makes it easy to monitor performance, track progress, see what data has been transferred, and troubleshoot issues if they arise.
  • Built-in security: Resilio Platform uses advanced encryption and security protocols to ensure that data is transferred and replicated securely. The solution can be air gapped—and there’s no reliance on 3rd-party security services.  This helps to protect sensitive data from unauthorized access or interception, and ensures that data remains secure throughout the replication and synchronization process.
  • Cross-platform support: Resilio Platform supports a wide range of platforms, including Windows, Mac, Linux, and NAS devices. This makes it easy to transfer and synchronize files between different endpoints, regardless of the platform they are running on.  
  • Flexible deployment options: Resilio can be deployed at the edge, on-premises, in the cloud, or in a hybrid environment. This gives businesses the flexibility to choose the deployment model that best suits their needs, and to scale their infrastructure as their requirements evolve.  

How Does it Work? 

Resilio Platform is a software-only, P2P solution for replicating, transferring, and synchronizing files across any IP network. A central management console offers a single pane of glass for managing remote Resilio agents. Each agent is installed on commodity systems. In our testing, we used x64 Linux VMs in Azure. But your computing resources could use any type of server (physical or virtual) and pretty much any type of storage. The faster the storage, the better each individual node will perform–and the better the overall system will perform. Each node can run Windows, Linux, macOS Server, or FreeBSD as well as a few NAS-specific operating systems. In widely distributed deployments, customers can also use a variety of desktops and other devices, including cloud endpoints.

With scale-out, the collective power of multiple agents is intelligently and automatically pooled. Combined with the core mesh architecture of a Resilio Active Everywhere deployment, file transfer performance scales as more agents are added.

With scale-out, the collective power of multiple agents is intelligently and automatically pooled. Combined with the core mesh architecture of a Resilio Platform deployment, file transfer performance scales as more agents are added.

Traditional point-to-point systems, like GridFTP and parallel rsync, by contrast, are limited to scale-up between at most two computer systems at once. P2P transfers break this two system limitation — to enable transfers among multiple systems concurrently — yet there is still only one peer writing to a file at a given time.  While some storage systems like EMC Isilon are scale-out architectures, its replication system, SyncIQ, is unidirectional — and, like rsync, can only sync files in one direction between at most 2 clusters (or used in a hub in spoke). 

Resilio’s implementation of scale-out enables multiple loosely coupled peers running Resilio agents (like nodes in a cluster) to cooperate and read/write concurrently to one or more files. This breaks the bottleneck of having a single node read/write operations to a file. Resilio is also multi-directional—in that you can replicate from a single system to multiple systems and many systems to many systems—exponentially faster than these traditional point-to-point technologies.

Scale-out is made possible through the use of advanced P2P algorithms and data hashing techniques. Unlike traditional P2P, where a single peer reads and writes to a file, scale-out enables groups, where multiple agents can work in parallel to transfer and sync files. Combined with Resilio’s UDP-based protocol, ZGT, the system also overcomes latency and packet loss. While we have tested up to 100 Gbps, there is no design limit on how fast it can go. 

This is particularly useful for organizations that need to transfer data between multiple sites on a regular basis. It’s easy to set up automated workflows that ensure files are replicated and synchronized as soon as they are created or updated. All of this happens in the background and doesn’t require user interventions. 

Another major benefit of Resilio is the ability to keep all of these distributed data sets rapidly and efficiently in sync. Unlike GridFTP and parallel rsync, Resilio Platform syncs really really well—and at scale. It’s also easy to deploy and use, provides resilience and file integrity end-to-end, and scales performance on demand. Moreover, through multidirectional sync (vs simple one-way rsync), Resilio keeps files current across as many locations as needed. 

Scale-Out Internal Test Results 

Over the past month, we have been testing scale-out for Resilio Platform in a number of scenarios. The highlight result was moving a 1TB data set in about 90 seconds! 

Other results thus far include intra-cloud testing on Azure. The test system scaled performance linearly up to 32 nodes. Moreover, the results prove that payloads can be massive, of any size and type. Nodes in a cluster collectively work together to transfer single large files — and payloads may contain multiple files of varying sizes; even up to hundreds of millions of files in a single job. As nodes are added to a cluster, replication and sync performance scales linearly, as individual nodes are added to the job. 

A number of performance tests were run in Microsoft Azure. In this scenario, there were a range of Resilio agents (from 10 to 50) included in the job. Each Resilio Agent ran on a commodity Linux VM (node) capable of 5-6 Gbps per node. Each VM ran Ubuntu 22.04 with 16 cores and 32 GB RAM.  

Here’s a single pane of glass view of the Resilio Platform management console, used to manage all of the Resilio agents—which are added to individual job runs. 

Here’s a single pane of glass view of the Resilio Active Everywhere management console, used to manage all of the Resilio agents—which are added to individual job runs. 

Here’s the result for distributing a payload containing (10) 100 GB files:

Here’s the result for distributing a payload containing (10) 100 GB files using Resilio

In another run with a similar payload, when the available bandwidth was increased to a max value of 125 Gbps, a max speed of 120.7 Gbps was achieved. 

Multi-Cloud Over Distance

One of the advantages to Resilio Platform is the ability to transfer files anywhere—using your cloud provider or IT infrastructure of choice.  If there’s distance between the sites, there’s usually some amount of latency and packet loss as well. Resilio is able to overcome latency and loss to obtain full utilization of network bandwidth between sites. Any job in Resilio Platform can be set to use our UDP-based WAN optimization technology, Zero Gravity Transport (ZGT). This technology ensures high performance across long distance networks such as WANs. 

As part of the scale-out testing, performance was measured on Google GCP regions between London and Australia.  The performance tests show that WAN optimization proves to be a big benefit when there’s considerable latency on the network.  We’ll be publishing numbers on this at a later date once the tests are completed. 

Summary

In summary, Resilio Platform offers linear high-performance speed up, or scale-out, for file ingest, distribution, and synchronization workloads. With its advanced algorithms, automation capabilities, and range of other features, it offers commercial companies and leadership class HPC and science facilities an easy and reliable way to transfer and synchronize large amounts of data within and across multiple sites—as fast as the network allows. Resilio Platform for scale-out file replication and synchronization is a versatile solution that can be used across a wide range of industries and use cases. 

Who should test drive Resilio scale-out? Anyone moving large and high-value data sets across multiple locations—that needs to go faster as well as keep files current or updated within and across multiple locations.  Please get in touch with us and we can show you a live demo—or get you set up with your own proof of concept as part of the Resilio scale-out beta program.

Overview

Scale-out file replication adds linear performance to transfers by adding nodes to a cluster. Now you can transfer massive file sets at 100Gbps and beyond.
Related Posts