Failover Testing Guide: Ensure Network Stability

Should your company face the inevitable and need to recover from an outage or disaster event, preparation is key. Failover testing, which is part of an IT disaster recovery plan, helps you ensure your systems are prepared for a recovery.

This article overviews what failover testing is, the importance of it, and IT disaster recovery (IT DR) solutions to help you prove that your company can recover critical IT systems should a disaster event or outage occur.

What is failover testing?

Failover testing is the process of validating a system’s ability to withstand a failure to ensure sufficient resources are available to recover.

Whether your applications are on-premises (aka data centers) or in the cloud, the same principles apply for failover testing. However, when managing cloud workloads there is added complexity and nuances that you need to consider.

For example, it’s more difficult to test availability zone (AZ) or region failovers compared to a data-center-to-data-center failover due to dependencies on the cloud provider, networking and cloud expertise.

Failover testing vs. disaster recovery

What is the difference between failover and disaster recovery?

Failover is the process of moving servers or applications from one infrastructure to another when a disaster or outage strikes. Disaster recovery is the overall process for recovering IT systems and applications during a disaster event or outage. In short, failover is one of many steps in a disaster recovery process.

Failover testing is validating the failover process to help ensure resilience.

Types of failover tests

There are four primary types of failover tests:

Manual failover - A person switches the application to the backup infrastructure and verifies it functions correctly.
Automatic failover - Using software scripts to automate switching an application to the backup infrastructure when an outage is detected.
Load balancing failover - Distributes network traffic from the original server across multiple servers to improve performance and reliability.
Network failover - Simulates a network outage and verifies that the backup infrastructure functions correctly.

Planning for failover testing

Regardless of the failover testing scenario, you need to make sure that you have a solid testing plan. A best practice is to review tips for a disaster recovery plan (DRP) and include components in your failover test plan.

Here are a few essential practices for a successful failover testing plan:

Validate access controls and configurations

Prior to the failover test, confirm that the infrastructure setup, configurations, and access management are accurate for all systems and parties involved.

Perform tests in a controlled environment‍

Use an isolated, controlled environment to prevent any data loss or impact to production environments.

Make failover tests as realistic as possible‍

Test multiple scenarios considering the types and locations of applications. Test both planned and unplanned failovers to ensure your disaster recovery strategies are comprehensive.

Automate failover tasks to reduce manual error

‍To minimize failover time, automate failover tasks such as application and data replication and the deployment of recovery instances.

Communicate throughout the process‍

Make sure all team members and stakeholders are aware of the planned (or unplanned) outage and that expectations are set. Maintain communication during the entire failover test with automated notifications when possible.

Review metrics with post-failover reporting

‍Benchmark systems beforehand, track key metrics during the failover, such as recovery time objectives (RTOs) and recovery point objectives (RPOs), and review a detailed report of findings to then make any required improvements.

Failover test challenges

It’s difficult to perform failover tests that account for every scenario. Outages happen for a multitude of reasons - natural disasters, power failure, etc, but one of the most common reasons for outages is human error.

According to the Uptime Institute, nearly four fifths of all outages are caused by human error, so it’s often difficult to replicate the exact causes of outages.

In the cloud, it’s difficult to ensure systems have high performance under varying operational conditions without compromising availability. Cloud service providers (CSPs), like AWS, offer products and services that help simplify and accelerate failover tests and live recovery scenarios.

How to perform a failover test

Ideally, during a failover test you verify and test all functionality of systems. In the simplest terms, a failover test includes the following steps:

Set up recovery plans in a runbook
Launch a recovery instance for source servers
Initiate data replication to the recovery instance
Validate the failover test is complete
Perform a failback replication - To the source server or a new server (including copying any new data written on your recovery system). Or, maintain operations on the new recovery instance
Terminate, delete, or disconnect the recovery instance, if necessary
Analyze failover test results and metrics including RTO and RPO

You can simplify and accelerate cloud failover tests with add-on services that allow you to automate and scale failover testing and disaster recovery processes. For example, AWS provides services such as:

AWS Elastic Disaster Recovery Service, which enables you to automate environment configuration and launching instances to scale failover testing and disaster recovery.
AWS Fault Injection Service (FIS), which allows you to run controlled experiments to improve an application’s performance, observability and resilience and automatically roll back or stop the experiment or failover test if needed.

Failover testing in action

There’s a close correlation between failover testing and recovery. Similar to an IT disaster recovery plan, you need to select key metrics, measure the progress, and then analyze the final results of each failover test.

Two of the most important metrics to track when executing failover testing are RTOs and RPOs. These two key metrics can be defined as:

RTO - the maximum allowable amount of downtime for an application
RPO - the maximum allowable amount of data that an enterprise can afford to lose; measured from the moment of failure to the last valid backup

Failover testing example: Using automated runbooks to ensure success

Challenge:

A major stock exchange needed to meet regulatory requirements with a data center failover test from a primary data center to a secondary one and back again. The test included coordinating thousands of activities and transitioning thousands of services.

Based on the complexity of this failover, they did not think spreadsheets would allow them to orchestrate the failover well. Since the exchange is a critical national infrastructure it needed to prove, not only to regulators, that they could fail over in the event of a real disaster.

Solution:

The exchange implemented Cutover’s Collaborative Automation SaaS platform to execute the failover test. Using Cutover’s automated runbooks they successfully orchestrated the failover and tracked performance of each application with real-time dashboards. Key team members used a dashboard to track the event in the control room while senior stakeholders monitored the event via mobile devices.

Results:

With Cutover’s automated runbooks, the exchange saw results including:

Reduced planning and preparation time for the failover by 80%
Significantly mitigated risk and proved its resiliency to regulators
Created a repeatable process with standardization for full data center failovers every six months

How Cutover can help with failover testing

Cutover helps enterprises streamline and simplify failover tests with automated runbooks. Learn more about how to connect teams and technology and take the risk and cost out of IT disaster recovery and failover testing - schedule a demo today.

Kimberly Sack

IT disaster recovery

What is failover testing? Its crucial role in simulating network outages