Not all IT disaster recovery testing is equal. All organizations are at a stage in the maturity curve and the work of improving testing and recovery is never done, as new threats emerge all the time. Read about the five stages of IT disaster recovery testing maturity to see where you fall and find out how you can improve.
Stage 1: Unstructured IT disaster recovery testing
Stage one of the disaster recovery maturity curve is self governed with individuals finding their own ways to deal with recovery needs - it is random and undocumented and has no dedicated resources and budget. This approach to recovery is often reactive rather than proactive.
In this unstructured testing phase, recovery plans are completely manual and usually paper- or spreadsheet-based and there may not even be recovery plans for every application or service. In a recent survey of 300 IT disaster recovery decision makers, we found that 40% are not using any automation at all for their recovery activities and 24% don’t have executable recovery plans. If large-scale tests such as data center tests are carried out at all, they are done infrequently. At this stage, tests consume a large amount of organizational effort to plan and execute - sometimes as long as 12 weeks.
Maintaining recovery plans at this stage is difficult and time consuming. There is greater risk and uncertainty as organizations don’t have confidence in their ability to recover in a timely fashion. This also leads to a reduced capacity to perform multiple tests throughout the year.
- Lack of confidence in your ability to recover
- Increased risk of regulatory fines
Stage 2: Regular IT disaster recovery testing
In stage two, disaster recovery processes are usually managed at the departmental level, so although they are more structured than stage one they are still siloed and have few resources dedicated to them. Recovery Time Objectives (RTOs) if defined may not be regularly measured and assessed.
There may be a regular review of recovery plans taking place but usually only after a significant change has happened. Large-scale testing is likely performed on a regular basis and recovery plans will exist in an executable form.
This level of testing maturity ensures recovery plans are kept in step and reviewed in light of the most recent changes to applications and services. At this stage, there is increased confidence that the organization can effectively and quickly recover from a total loss scenario. Capturing, executing, and measuring the recovery steps are no longer separate activities but are closely linked.
- Increased confidence
- Reduced risk
Stage 3: IT disaster recovery testing with automation and improved scenario coverage
In stage three, organizations are moderately prepared for disaster recovery but not quite mature. Recovery staff are funded, there are documented recovery plans, and RTOs are defined but there are still siloes. Some business units may have achieved a high state of preparedness but, as a whole, the enterprise is at best moderately prepared and still lacking executive buy-in and funding.
On the positive side, at this stage organizations will start automating manual activities where appropriate and will have recovery plans for all criticalities of applications and services.
The addition of automation can lead to a reduction in Recovery Time Actuals (RTAs) and allows team members to focus on higher-value activities rather than being bogged down in manual processes. This increases confidence that the organization can quickly and effectively recover applications and services.
- Increased confidence
- Reduced recovery time
- Reduced costs thanks to automation
Stage 4: IT disaster recovery testing with integrations to ITSM and CMDB
At this stage in the maturity journey, senior management is bought in and committed to funding recovery efforts. There is regular testing across all departments, the process is defined and regularly updated, and there are some integrations between technology resilience tooling and the ITSM suite for change, problem, and configuration management (although the full potential has not been reached - only 6% of the respondents in our survey said that 100% of their recovery activities are automated). Customer intelligence data from the CMDB is being appropriately used to augment recovery plans.
- Reduced effort
- Improved experience
- Using golden sources of data
Stage 5: Optimized IT disaster recovery testing as response
At the final stage, IT disaster recovery is state of the art, there are integrations across the tech recovery stack and the system of execution is an automated recovery plan that everything integrates to. With this setup, progress can be viewed in real time, there is continuous improvement to the process based on audit logs, and senior management participates in recovery activities. Change control methods and continuous process improvement keep the enterprise at a high state of preparedness and able to adapt to changes in the business environment.
With this level of maturity, testing can be performed unannounced and accurately mirrors how the enterprise would respond in a real incident. The organization can intentionally degrade active/active systems and applications during the online day to prove resilience. Major Incident Management resources coordinate and execute test events to gain familiarity with the process and tooling and the organization is beginning to stress its systems and people to identify areas of improvement before an incident occurs.
- Reduced risk
- Increased confidence in your resilience posture
Get on the path to IT disaster recovery testing maturity
The right tooling is essential to progressing along the IT disaster recovery maturity journey. As mentioned above, the ability to automate laborious manual processes and integrate with your entire technology stack have a huge impact on the confidence you have in your recovery and the speed and accuracy with which you can test.
The advantages of using Cutover for IT disaster recovery testing:
- Preparedness for unannounced disaster recovery simulations and events
- The ability to rehearse incident simulations for network, system, and application outages and human errors
- The elimination of manual processes through automation and integrations across the tech stack
- Documentation of the recovery process in detailed, dynamic automated runbooks
- Keep teams and technology connected for efficient collaboration
- Use post-event analysis to optimize plans, reduce risk, and meet RTOs