No items found.
Loading
Resources
eGuide
Deep read

Guide to failover

Register to download

Guide to failover in it disaster recovery
Download
Download
Watch now
Watch now

Failover is a necessary process for achieving high availability and reducing negative impacts on customers when there is an outage caused by anything from a cyber attack to a natural disaster. This guide will cover the definition of failover, processes for failover testing and execution, different types of failover, and common challenges and solution.

What is a failover?

A failover is the process used to transfer control from one location or site to another when there is a fault or failure in the first location. Failover can apply to on-premises, cloud and hybrid systems and can be done manually or automatically. Failover forms part of the larger disaster recovery plan for recovering IT systems and applications during a disaster event or outage.

What is failover testing?

Failover testing is the process of validating your system’s ability to fail over successfully and become available. Failover testing involves validating access controls and configurations, performing tests in a controlled environment, and reviewing metrics for continuous improvement.

The top three different kinds of failover

Depending on your organization and individual system and customer needs, different types of failover may be right for you. Here are some examples of different ways failover can be used:

1. Manual vs automatic failover

Failover can be initiated manually, where a person switches their application to the backup infrastructure and verifies it functions correctly. Increasingly, organizations are using automatic failover, using software scripts to automate switching an application to the backup infrastructure when an outage is detected.

2. Failover and back

Failover and failback is usually used in the context of testing to examine the functional aspects of failover. In this case, an application or system is failed over in the test and failed back in the same test.

3. Fail and stay

Fail and stay refers to failing (during a test or incident) to the alternate site, staying there and running production load for a period of time. Many organizations are moving more towards fail and stay and away from failover and failback, as using failover and back in testing doesn’t prove that the alternate site’s infrastructure can handle production load.

Common failover challenges


Below are some common challenges faced by organizations in planning, testing and managing failovers:

  • The amount of time it takes to prepare for a test
  • Mismatched environments where the alternate site does not have enough capacity to run production load (at all or for an extended period of time)
  • Tests don’t match what would be performed in an incident, so they don’t provide readiness or confidence for an actual disaster event recovery
  • Simulation of loss of a public cloud region is hard and complex

Steps to performing a failover 

  1. Transfer data to the alternate site at appropriate intervals to ensure that recovery point objectives can be met.
  2. Transfer production workloads to the recovery site, although some changes can occur as operations continue.
  3. After any failure-related disruption and data losses are resolved (and any known threat is mitigated) the primary production site can resume operations. At this point, the failback operation is executed - production workloads return from the recovery site and interim data transfers to the primary system. However, with fail and stay becoming the norm this step is no longer necessary, as there is no longer really a “primary” and “secondary” site and any site can act as the primary one.

Cutover’s automated runbooks for failover

Cutover offers a comprehensive solution to address your failover challenges and streamline the overall process.

What are the  benefits of using Cutover for failover? 

  • Codify and automate failover as part of IT DR runbooks
  •  Analyze, iterate and audit your IT DR and failover strategy 
  • Optimize failover with automation 
  • Execute failover tests within recovery time objectives (RTOs) and recovery point objectives (RPOs) 
  • Make better decisions during a failover event 
  • Save time during execution and postevent reporting

Cutover runbooks make failovers simple

Here’s one example of how Cutover helped one of our customers improve their failover testing and execution:

THE PROBLEM - Highly manual and uncoordinated failover procedures

THE SOLUTION - Automated runbooks and comprehensive dashboards for data center failovers

THE OUTCOME- Faster Recovery

Cutover was ten times better than the previous manual way of working . The team saved three hours per event on post-event audit and they now have a version-controlled source of truth for the failover runbook and the ability to review and approve the entire process within Cutover.

Book a demo to see for yourself how to failover with Cutover