See How One Enterprise IT Team Speeds up DFS Replication and Synchronization Using Resilio Connect

Introduction

At Resilio, we work with many companies who have had challenges with the DFS replication service. Here’s the story of one company who, while wishing to remain anonymous because of company policy, also wanted to share their experience to help other IT professionals in the same situation.

One of the largest employee benefit administration companies in the US today uses Resilio Connect to speed up file synchronization and improve file access time for Microsoft’s Distributed File System (DFS). Resilio Connect gives them an easy and predictable way to scale-out performance, reduce recovery times, and simplify managing and synchronizing files across their multi-site DFS namespace.    

One of their biggest obstacles was reliability--and keeping files in sync and accessible to users while using DFSR.  (DFSR is the successor to the File Replication Service or FRS).  

Having struggled with DFSR for years, the company’s IT team turned to Resilio. According to the company’s IT manager, their top two goals were meeting SLAs for reliability and availability, and simplifying replication within Microsoft’s Distributed File System (DFS) namespace to share files across multiple sites within the organization.

Other goals for the IT team included:

  • DR enablement: Moving from active-passive to active-active high availability (HA) and continuous data availability with Resilio Connect

  • Streamlining upgrades and maintenance across DFS targets; reducing the time it takes to perform upgrades during time-sensitive maintenance windows

  • Improving overall data accessibility for all users and a variety of line-of-business applications and workflows

  • Simplifying management operations to reduce total management time, improve diagnosability, and gain visibility into enterprise-wide file services

Challenges

The company’s challenges with DFS impacted recovery times, upgrade times, and the IT team’s ability to keep multiple file systems synchronized and accessible to users.  The team spent endless hours diagnosing, troubleshooting, and sifting through diagnostic reports and event logs surrounding DFS replication.  

“The replication status of files was impossible to track.  There was a constant and growing backlog of changes that were not getting replicated,” said their IT manager.  “There was simply not enough time in the day to replicate all of our files.” 

Better Options to Consider Instead of DFS Replication in Windows Server

As the backlog grew, so, too, did the frustration of the IT team. There weren’t any good monitoring or management tools provided with the DFS replication service. 

Another issue was meeting the IT team’s goals for high availability (HA). The company could not achieve more than active-passive HA for DFS. The time it took to switch between target servers and sites was cost-prohibitive. Replication was unpredictable: sometimes measured in many hours. Other times it took all day or multiple days. 

The company’s data set includes a variety of file sizes with the number of files growing into many millions. Soon it became impractical to keep all of the files synchronized across all server nodes and sites.  Moreover, due to the need to replicate files of varying sizes and types, DFSR would falter or fail when synchronizing files across the WAN; any latency or hiccups on the wire would bog down DFSR.  

Another big challenge was meeting the company’s SLAs for high availability and business continuance during planned maintenance windows. Software and hardware updates as well as periodic maintenance activities turned out to be disruptive to the company.  In some cases, an entire DFS site would have to be taken offline - and thus completely unavailable to all employees in the organization during updates.   

“Implementing software updates across sites and directories was brutal using DFSR,” said the IT manager.  “There was an arduous, manual fail over process switching from the active to the passive site during outages. Enabling the new DFS target and disabling the previous target was a pain. When we patched one server with the DFS targets available, the server being patched would no longer be available.” 

During planned or unplanned events, there was no way to replicate files with DFSR. A continuous and growing backlog of files queued up, often breaking DFSR.  This inability to replicate or sync the data impacted file data availability and recovery points.   

The IT manager said DFSR replication was so unreliable that it created “data access gaps”, where data was too far out of sync and users could not access their files for hours and sometimes days on end.  Failover operations were manual.  Enabling and disabling the DFS target took too much time.  It took at least 15 minutes for the Active Directory (AD) Domain Controllers to sync and to make the new DFS active target available to employees. 

According to the IT manager, with the initial sync:  “With DFSR, you’re constantly waiting on sync,” stated the IT manager.  “If you update changes on one side of the topology while a replication job is still running, you run into a conflict deleted issue, which spirals and snowballs out of control using these legacy tools.”  

Managing file permissions was also challenging. NTFS permissions sometimes did not replicate correctly.  The IT manager found that other customers faced similar issues, from managing replicated folders to keeping destination servers updated to coping with SYSVOL corruption during updates. Dfsrdiag and other command-line diagnostic tools did not address the root cause of the problem.  

The IT manager painstakingly explained with the DFSR backlog.  “The best case scenario with DFSR was that the backlog was in a good state; ideally, we could have those changes available to users—but it was unpredictable and unreliable.  If switching (failing over) to the new active target server in DFS didn’t work correctly, that caused a huge mess.” “There wasn’t enough time in the day or available staff hours to do what we needed to do with DFSR. Managing replication with DFSR was a full time job.”

“There wasn’t enough time in the day or available staff hours to do what we needed to do with DFSR. Managing replication with DFSR was a full time job.”

Solution for DFS Replication on Windows Server

The IT team determined that Resilio Connect offered a far superior solution for file replication. Today they have 8 Windows Server 2019-based file servers running in Microsoft DFS, spanning 3 sites. Active Directory was still on Windows Server 2016.  

Using Resilio Connect, the team more fully utilizes their existing IT infrastructure software and hardware. Resilio Connect is low latency and speeds up synchronization—scaling to support many millions of files and any number of servers (or endpoints). Resilio uses a peer-to-peer (P2P) architecture enabling highly resilient, scale-out replication and sync across any number of locations.

Today they have 8 Windows Server 2019 file servers working seamlessly with Microsoft DFS, spanning 3 sites. Each server runs in a virtual environment connected to a Pure Storage SAN. The company migrated from another storage provider to Pure Storage—and used Resilio Connect to perform the data migration.  

Conventional replication tools, by contrast, employ a point-to-point design which serializes replication between 2 servers and is unidirectional. Resilio Connect, by contrast, hashes and replicates files across multiple servers concurrently--which can be one- or two-way, one-to-many, many-to-one, or many-to-many.

Another big advantage of Resilio over DFSR is fast synchronization over any distance. Resilio Connect includes WAN optimization to obtain predictable speed and fully utilize bandwidth across any location.  Thus, the core transport in Resilio Connect efficiently scales transfer performance, in some cases up to 20x faster than conventional WAN optimizers and copy tools.  

Architecturally, Resilio Connect is an Agent-based solution.  Resilio Agents are installed on all devices participating in synchronization jobs. Resilio employs a peer-to-peer architecture where data is hashed across all Agents in the file synchronization job.   

Job types include distribution, consolidation, scripting, and synchronization. All jobs can be automated through a RESTful API or scripted. Resilio Connect agents are cross-platform and run on Windows, macOS, Linux, and other server and storage platforms. Resilio also supports popular virtualization platforms and cloud providers. 

The Resilio Connect Management Console is a centralized, web-based management system used to manage and monitor all agents and job functions through an easy-to-administer graphical user interface.  The Resilio Connect Management Console runs centrally on Windows or Linux.  The company’s IT administrators are able to manage agents, create and run jobs, control speed, and deploy job instructions across the organization. Once jobs are running, Administrators are able to monitor and collect statistical data globally.  Optionally, Resilio offers a complete API set to expose and automate all functions.

Impact

Using Resilio Connect, the company has moved over 100 million files which amounts to over 200TBs of data. Resilio Connect enables fast, real-time synchronization. The company is able to synchronize 100 million files and many terabytes of data in a matter of hours vs days.  Prior to Resilio Connect, replicating millions of files could not be achieved prior to Resilio. 

In short, Resilio Connect was chosen vs the competition because:

  • Enabled active-active high availability for DFS for DR

  • Reduced recovery times (RTOs) from days and weeks to about 15 minutes for planned events, such as maintenance and upgrades

  • Established a recovery point objective (RPO) of 2 minutes to improve both data availability and file accessibility: This went from days or weeks with DFSR to an end-to-end latency of less than 2 minutes

  • Saved 13 hours of labor per week. Reduced management time from 14 hours with DFSR to less than 1 hour per week with Resilio Connect

  • Gave the IT team peace of mind—and the ability to sleep better at night—knowing that synchronization just works reliably and predictably all the time, no matter what

Active-active HA was enabled for planned maintenance and unplanned outages. Changes on the target systems occur within 2 minutes of the update on the source system.  “This was not possible before Resilio,” said the IT manager. “It’s a big benefit to be able to update and operate the IT infrastructure without impacting users and applications.”  

The ability to recover from unplanned events or outages gives the IT team great flexibility in recovering from failures and getting data and virtual machines back online faster. This reduces operational costs and saves time.

“Our allocated time shifted from troubleshooting DFS replication to simply monitoring automated jobs with Resilio Connect. Replication is fully automated and highly available and DR capable. In a true disaster, if one site does go offline and is truly down, we are still up and running because the sites are in sync and data on the live site is still available to users.” 

Resilio saved the IT team over 13 hours per week in management time. What took 14 hours per week with DFSR now takes less than 1 hour using Resilio Connect. “We went from spending at least 2 hours per night dealing with replication and troubleshooting issues with DFSR to spending about 1 hour per week monitoring jobs using Resilio Connect.” 

Another critical benefit was improved (and continuous) data availability and non-disruption to end-users and applications. Changes are visible and non-disruptive to users. “Users and applications are able to notice file changes without being impacted.”   

From a client perspective, where users interact with the DFS namespace for file sharing and access over SMB, nothing changes. With Resilio Connect, users always have access to files and file shares.

Resilio Connect replicates files across servers and other endpoints in any direction

Another major milestone was overall DFS namespace and file storage system reliability.  “We want to avoid time-outs and disruption. There’s some required time to update applications and make files available to users. It’s critical that users have access to files all the time. In the past, it took too much time to make a file available through DFS.”  

DFS has the concept of an active and passive target server. One of  the IT manager’s goals was to hide active target server changes to end-users (redirecting users across sites) for a more seamless user experience. For example, with the original DFRS approach, an end-user could be logged into the AD and DFS environment, and their files and folders would appear to be available--but were actually in a sparse, offline state and not available. This is because DFSR replication was still in a backlog state and incomplete  (i.e., out of sync). This was frustrating to end-users.  

With Resilio Connect, the user experience is seamless and transparent.  Redirection from one site to another site works without issue or user intervention. “With Resilio Connect, if there’s a server outage, DFS automatically redirects to the new, available target that’s now in sync.” 

Another big benefit to the IT team is flexibility--and the ability to leverage a variety of storage systems and platforms. Resilio supports open file formats and OS platforms, enabling replication across Linux and other server and storage types. 

The IT team is hoping to incorporate the Microsoft Azure cloud in the DFS namespace. They like the fact that Resilio Connect supports Azure out-of-the-box and can sync and provide access across multiple Azure regions.

Solution for DFS Replication on Windows Server

 “We don’t have to be locked into a certain storage or cloud vendor because Resilio offers cross-platform and multi-cloud Agents. We can get the data there using Resilio Connect.  We have flexibility in the choice of storage and can migrate to different storage or cloud providers if needed.

”Not having to remote-in from a client to a data center by using the Resilio Connect Management Console to centrally manage their deployment is a great improvement. “Now we have complete visibility and control into how our files get synchronized and delivered across our organization.” 

“At the end of the day it’s really about peace of mind and knowing that replication and sync just works all the time; no matter what: Resilio really put us in a good spot. I can trust the file is getting replicated. Even folders with multiple targets get replicated. Compared to the previous solution, replication takes minutes vs hours or days with DFSR.”
Overview

Team gains peace of mind, saves 13 hours per week, and keeps 100 million files continuously synchronized and accessible through DFS

Resilio Connect was chosen vs the competition because:
  • Enabled active-active high availability for failover across all DFS sites for DR.

  • Reduced recovery times (RTOs) from days and weeks to about 15 minutes for planned events, such as maintenance and upgrades

  • Established a recovery point objective (RPO) of 2 minutes to improve both data availability and file accessibility: This went from days or weeks with DFSR to an end-to-end latency of less than 2 minutes

  • Saved 13 hours of labor per week. Reduced management time from 14 hours with DFSR to less than 1 hour per week with Resilio Connect

  • Gave the IT team peace of mind—and the ability to sleep better at night—knowing that synchronization just works reliably and predictably all the time, no matter what

"From a DR perspective, we likely saved $8.6 million in remote access to our graphics-intensive 3D workstations"— Head of IT, large US-based construction company (name withheld for privacy)
Read Customer Story
“There’s no more troubleshooting errors,” he says. “Through the Resilio Connect Management Console, the IT team has complete visibility into monitoring job status and progress.” — Head of IT, large US-based data-protection company (name withheld for privacy)
Read Customer Story