Is the Distributed File System Replication (DFSR) service causing you pain and frustration? Are your files not getting replicated or synchronized because they’re stuck in the DFSR backlog?
This popular but aging technology can easily turn a good day into a frustrating one. But you’re not alone. And the good news is, Resilio has a highly reliable and easy fix to your DFSR woes.
Issues with DFS replication not working properly are common: Files often sit in a “SCHEDULED” state with no clear way to begin syncing, and what happened to those files and the status of the replication is left unclear.
First and foremost, it’s difficult to diagnose and troubleshoot problems with DFSR. Customers and IT teams are forced to scour through articles, forums, and social posts to find solutions to DFS replication service issues.
In this article, we’ve compiled a list of the most common failure scenarios and ways to get insight into your DFS replication status.
We also discuss why these DFS replication issues keep happening and how we designed Resilio Connect, an alternative to DFS Replication (or DFSR), to overcome these issues and provide reliable, error-free file replication.
If you want to try replicating files with Resilio, you can get set up and begin replicating your Windows file servers in as little as 2 hours by scheduling a demo with our team.
First, Get Insight into Your DFS Replication Status and Environment
One of the biggest issues when DFSR is not working properly is the lack of insight or visibility into the state of replication in your environment. This makes it difficult to identify, diagnose, and resolve DFS replication issues, and adds stress to admins relying on DFSR to keep critical services operational.
When DFSR doesn’t seem to be working properly, your first task is to check the DFS replication status and narrow down the potential sources of error.
Here are 7 things you should check to identify potential issues (or skip these steps and fix DFS replication now with Resilio):
1. Check Your DFSR Backlog
Use DFS command line in the following command lines:
- Get-DfsrBacklog: This command shows pending updates between two Windows-based file servers that participate in DFSRfile replication service.
- Get-DfsrState: This command shows you the current replication state of DFSR in regard to its DFS replication group partners.
2. Run a Diagnostic Report
- Start DFS Management.
For Windows Server 2012 and later: Click Server Manager > Tools > DFS Management.
For Server 2008 or 2008 R2: Click Start > Administrator Tools > DFS Management.
- Expand Replication.
- Right-click on the replication group for the namespace.
- Click “Create Diagnostic Report”.
- Choose “Next” for the remaining windows of the wizard.
- The completed report opens in a browser. Note: You can also find the report under C:\DFSReports.
3. Use the Dfsrdiag.exe Tool That Provides DFSR Status
- Execute the following command from Powershell to install it: “Install-WindowsFeature RSAT-DFS-Mgmt-Con”
- Or, from an elevated command or powershell prompt, run DFSDiag /TestDFSIntegrity /DFSRoot: /Full.
- Review the results.
4. Narrow Down Potential Causes of the Error
- Ensure the server’s network interface card drivers are updated.
- Ensure that your antivirus software is aware of the replication and any necessary exclusions are set. You can also try disabling your antivirus software to see if that’s the issue.
5. Check for Bandwidth Throttling
- Start DFS Management.
For Windows Server 2012 and later: Click Server Manager > Tools > DFS Management.
For Windows Server 2008 or 2008 R2: Click Start > Administrator Tools > DFS Management. - Expand Replication.
- Click on the replication group for the DFS namespace.
- Click on the “Connections” tab.
- Right-click the replication group member and select “Properties”.
- Make sure “Enable replication” and “RDC” are checked.
- Click the “Schedule” tab.
- Click “View Schedule”.
- Make sure that the bandwidth usage says “Full”. You can also change the bandwidth throttling to see if there is a difference.
6. Check Staging Quota
- In Server Manager, click Tools > DFS Management.
- Expand Replication.
- Click on the replication group for the namespace.
- Right-click each member of the replication group in the “Memberships” tab.
- Click the “Staging” tab.
- The default quota is 4 GB. If 4GB is not sufficient, you can increase it.
7. Check Active Directory
Try checking the connectivity in your Active Directory by opening a command or Powershell prompt and using the following commands:
Command: $ DFSRDIAG dumpadcfg /member:SERVERNAME
This provides you with the details Active Directory has about DFS, the replication groups, and the folders it belongs to.
Command: $ DFSRDIAG pollad /member:SERVERNAME
This has the servers check-in with AD. The result of this command should be: “operation succeed”.
Command: $ FDSRDIAG replicationstate
This shows you what is replicating. If replication is working, you should see something like this:
- Active inbound connection: 1<br>
- Connection GUID: BE12378E-123D-41233-1238-123412B7AFD6<br>
- Sending member: YOURSERVERNAME<br>
- Number of updates: 6
- Updates being processed:
[1] Update name: 83b78c9696004f7797f319bfcc314d201.jpg<br>
[2] Update name: d1d86aa38477492680ff14ffffcc3fa61.fla<br>
[3] Update name: b131d9dbffca4b7faa82a3bd172271a72.swf<br>
[4] Update name: 5ac75c7ad2ae4d74931257d605205d441.swf<br>
[5] Update name: 856d568e07644803844988dfd5aab05b1.jpg<br>
[6] Update name: 1ebaa536c0574797a04ba5999e754aff3.swf<br> - Total number of inbound updates being processed: 6
- Total number of inbound updates scheduled: 0
While these methods can provide you with insight into the state of replication, narrowing down and fixing your replication issues will require some research, trial, and error.
The Most Common DFS Replication Failure Scenarios
The first place people often turn to for help diagnosing DFSR issues are popular technical forums. But with zero visibility into your system, there’s no way for a well-meaning stranger to identify your exact issue.
The best way to find and fix your DFS replication errors is to use the steps in the previous section to check the status of your DFSR setup, and use that insight to research potential solutions. Otherwise, you may find yourself wasting countless hours trying erroneous suggestions.
However, if you get stuck, we recommend the following articles that address common DFSR issues:
- Microsoft DFSR Issues & Resolution: This article discusses the 7 most common causes of DFS replication failure — including active directory replication issues, inadequate staging quota, sharing violations of open files, a corrupted DFSR database, unexpected dirty database shutdowns, conflicting data modifications, and accidental data deletion — and how to resolve them.
- DFSR no longer replicates files: This troubleshooting doc from Microsoft describes an issue where DFS no longer replicates files after restoring a virtualized snapshot. Notably, at the end, they suggest contacting Microsoft support to resolve issues.
- Microsoft DFS Issues: This resource is a compilation of forum questions and answers regarding various DFS issues.
Ultimately, however, you need to come to terms with the real DFSR issue: It’s a fundamentally unreliable replication tool that will continue to break down as your needs and replication environment grow and become more complex.
Plus, Microsoft is promoting Azure File Sync and not offering much, if any, innovation on DFSR anymore.
We discuss why in more detail below and how we designed Resilio to solve these issues in the subsequent section.
The 5 Fundamental Reasons Why DFS Replication Errors Persist (and Get Worse) in Your Environment
1. Replicating Files Over High-Latency, Long-Distance WANs (or Wide-Area Networks)
A common source of DFS replication issues occurs when you’re sending data to remote locations across high-latency connections (mobile, satellite, etc.) that have long retransmission time and high packet loss potential. Simply put, DFSR performs poorly over WANs or any network with any level of packet loss or latency.
DFSR uses a client-server (point-to-point) replication model that relies on TCP/IP. With client-server, there’s just one sender and one receiver.
DFSR (due to TCP and other reasons) treats every packet loss as a network congestion issue and reduces speed of transmission in order to reduce the load on the connection.
But in the case of WAN (wide-area-networks), packet loss might be due to a failure on the intermediate device, rather than channel congestion.
So, while reducing transmission speed for TCP/IP based networks helps them coordinate the maximum speed they can use for transfer, this method is inappropriate for WAN connectivity.
Another DFSR deficiency over WAN networks involves how TCP/IP protocols ensure data delivery.
With TCP/IP, the sender sends a packet to a receiver, and the receiver must send a confirmation packet back acknowledging that it received the packet. The time it takes a packet to travel from one to the other is known as RTT (retransmission time).
While the RTT for a LAN (local area network) is .01ms, it can be as high as 800ms over a WAN. This significantly reduces the speed at which each packet is transferred — up to 2 seconds between each new packet transfer.
Both of these issues are assuming DFSR can even transfer over your WAN at all. Because DFSR lacks WAN acceleration — i.e., technology for optimizing WAN transfer — it can’t reliably transfer over long connections of 3,000+ miles. The long distance significantly increases travel time and packet loss to the point where using DFSR becomes untenable.
2. Poor Scalability
Even if DFSR works as it should, real-time replication of large files and/or large numbers of files can be unbearably slow with DFSR because it:
(1) Must scan entire folders/files
To detect and replicate file changes, DFS must scan through the entire file/folder, find changes, then transfer them. Naturally, if it must scan through large files or millions of files, this will take a long time (even if it doesn’t just add files to your backlog without starting replication).
(2) Has no optimized checksum calculation
DFSR has no optimized way of calculating the checksum of a file. Instead, it uses an algorithm known as remote differential compression to detect changes in files and replicate only those changes. However, this process takes a long time to calculate file differences, making large file transfers even longer.
(3) Uses the client-server replication model
Even once files are scanned and changes are detected, Resilio must replicate those changes 1 to 1 — i.e., the sender server must send file changes to every other server in your system individually. The more destinations you must replicate to, the slower this process will be.
3. More Activity or File Changes That Need to Be Replicated
The more changes to files that DFSR needs to replicate, the worse it will perform. And thus, the more files that queue up in the DFSRbacklog.
As stated earlier, DFSRsynchronization is designed to scan each folder file by file to detect changes. And each time you make a change, the process of scanning each folder has to begin again. This can take a long time, especially when you have lots of files and/or large files.
4. More File Servers That Need to Be Replicated
Most organizations need to sync files across multiple locations and servers. But DFSR’s ability to synchronize files to more than one destination is limited, which is one of the most common causes of replication failure for DFSR. And the more servers that are added, the worse it will perform.
As a client-server transfer solution, DFSR executes replication one by one to each server.
DFSR needs static IP: ports to establish a connection to different machines. If a machine has a new IP: port or the IP: port is not available, DFSR stops operation and needs a human to re-configure it. There is no way to have scripting around DFSR. If you need to build workflows beyond a simple “do something after the file arrives at destination,” there is no way to do so with DFSR.
The one-to-one replication approach can also create problems if one server is far away or on a slow network, as every other server must wait until the initial transfer is complete before they can receive data.
5. Need for Active-Active High Availability
In an Active-Active High Availability scenario, you have 2 sites in different areas that are both actively serving users. If there is a failure at one site, users will be automatically redirected to the other.
The primary objectives of Active-Active HA are:
- Load-balancing (over tricky network connections and in VDI scenarios)
- Quick, accurate recovery of data (in DR scenarios)
- Fast, accurate replication of concurrent data changes
DFSR is not a good solution for Active-Active HA because:
(1) Poor reliability and scalability
DFSR may fail or not scale to support replicating many concurrent changes at once, and it is notorious for queuing up changes in a backlog and not fully syncing files.
DFSR is especially problematic in larger environments facing high user churn mainly around log-off storms. These events can create several thousand files per user all at once during a log-off event.
For example, when 1000 users concurrently log off—and need to immediately propagate the changes—you will likely overwhelm DFSR and cause it to crash or hang. Or worse, corrupt data.
(2) Slow replication
Because DFSR does not scale beyond 2 file servers, jobs must be synced between the 2 servers for replication to occur on a 3rd server. This slows replication speed even further.
How Resilio Overcomes These Challenges and Provides Reliable File Replication
WAN Acceleration & P2P Architecture
Resilio Connect uses WAN network support, allowing you to utilize 100% of the available bandwidth in your network totally independent of distance, latency, or loss. This increases transfer speed and reduces packet loss.
P2P Transfer
Resilio’s N-way sync architecture enables files to be transferred and replicated across the entire network of devices.
Files are split into blocks that independently transfer to multiple destinations, which can exchange blocks between each other independently from the original sender.
This enables Resilio to leverage internet channels across all locations to dramatically increase speed. And the more endpoints are added, the faster transfer occurs. Resilio Connect will be 50% faster than one-to-one solutions in a 1:2 transfer scenario and 500% faster in a 1:10 scenario.
Try our transfer speed calculator to see how much time we can save for you.
WAN Acceleration
Using Resilio’s proprietary transfer protocol — Zero Gravity Transport (ZGT™), Resilio minimizes the impact of packet loss and high latency and maximizes transfer speed across any network using:
- A bulk transfer strategy: The sender creates a uniform packet distribution over time by periodically sending packets with a fixed packet delay.
- Interval acknowledgement: Rather than sending acknowledgements for each packet, Resilio uses interval acknowledgement for a group of packets that provides information about lost packets.
- Congestion control: Resilio’s congestion control algorithm periodically probes the RTT (Round Trip Time) to calculate the ideal send rate.
- Delayed retransmission: Resilio retransmits lost packets once per RTT in order to decrease needless retransmissions.
Organic Scalability
Resilio overcomes these problems and is able to transfer at scale using:
(1) Optimized checksum calculations
A checksum is basically an identification marker that indicates whether a file has been changed or not. When a file changes, so does the checksum.
Unlike DFSR, Resilio uses optimized checksum calculations and real-time notification events from the host OS to detect changed files. It then replicates only the changed parts of a file to reduce the load on the network and increase transfer speed.
(2) File chunking
Resilio uses file chunking, i.e., transferring files in small chunks. In the event of a network failure, it can perform a checksum restart to identify where the transfer ended so it can pick up where it left off — unlike DFSR, which has to start again from the beginning.
(3) Omnidirectional P2P transfer
Resilio’s omnidirectional file transfer capabilities means large files/numbers of files can be quickly replicated across your entire system. Once changes are detected, Server A can replicate those changes to Server B — which can start replicating those changes to other servers immediately.
Resilio Connect uses a dynamic routing approach that specifies when server A and B need to exchange data. This requires no human intervention, as both servers will use a tracker or multicast to discover the required IP: port address on the fly. Resilio Connect lets you take control over the file replication process, see its progress and evaluate the results.
And with P2P omnidirectional file transfer and file chunking, every server can share data blocks with other servers as soon as they are received. This dramatically speeds up real-time syncing operations since:
- Several servers are transferring concurrently
- Other network channels help offload loads from a sender network channel
- Servers that are farther away can receive data from the server closest to them
And with ZGT ™, Resilio is sensitive to bandwidth changes and is smart enough to avoid network congestion or use full bandwidth when possible.
In addition, data replication with Resilio isn’t just limited to Windows. It can be easily configured cross-platform on Linux, OS X, iOS, and Android.
High Availability
Resilio is perfect for Active-Active HA scenarios because it:
(1) Employs omnidirectional transfer
Omnidirectional file transfer is ideal for an Active-Active scenario, as each server can send and receive data to any other server and share the load balance between them.
File chunks are distributed across multiple replication endpoints in parallel. From a VDI perspective, this gives you the flexibility to replicate file changes anywhere — at any time.
This also creates faster time-to-desktop. One customer saw a 3x faster time-to-desktop for VMware DEM compared to snapshot-based storage replication.
(2) WAN optimization
Resilio can optimize data transfer over any network to ensure data transfer is as fast as possible. And users can access the servers closest to them.
It can dynamically route around failures and overcome latency.
Visibility & Centralized Management
Other tools (especially DFSR) leave you in the dark about the status of your system. Resilio’s dashboard provides real-time notifications and detailed logs that give insight into replication on your network.
Resilio also enables you to adapt key replication parameters, such as:
- Bandwidth control: You can adjust bandwidth usage based on the time of day or the day of the week, as well as create different schedules for different jobs and agent groups.
- Network stack: You can optimize performance by adjusting parameters such as buffer size, packet size, and more.
- Storage stack: You can control file priorities, data hashing, and more to meet the needs of your operation.
- Functionality: You can use Resilio’s REST API to manage agents, create groups, script functionality, or report on data transfers in real time.
Resilio’s configurability lets you optimize performance by controlling costs and resource use as well as spotting and fixing any issues.
Replicate with Resilio
DFSR is simply not a great replication solution for organizations that need to replicate large files. DFSR issues will continue to persist, create a bottleneck in your workflow, and be an endless source of headaches.
If you want faster, more available, scalable, and reliable replication that always works, try Resilio today.