IT disasters are events that cause negative impacts to technology services. This could include hardware failures, software problems, cyber attacks, and more. Outages related to software, network and systems issues are on the rise and hackers are becoming more sophisticated when it comes to breaking through organizations’ resilience defences. With high costs and damage to reputation and customer relations on the line, organizations must be prepared to address these issues.
An organization can take a number of approaches to address the risk of IT disasters. For example, they could architect their service or application to continue running in the event of a public cloud region failure - but this is expensive and is often only an appropriate solution for the most critical services. Alternatively, they could try to minimize downtime by relying on key people to manage each incident in an ad hoc way as they happen. Both of these approaches have their uses but they are not the only choices. It's increasingly the case that even architecting will not cover all IT disaster scenarios and relying on key individuals is high risk and doesn’t provide a consistent approach.
Is there a better way to plan for IT disaster recovery?
There is an alternative: To have flexible, visible executable recovery plans that orchestrate across manual and automated tasks with redundancy built in for cases when all of that automation is not available. Below, I detail my thoughts on this third way of managing IT disasters by using automated runbooks to capture a flexible codified response that can lead to better and quicker recovery.
Many organizations’ IT disaster recovery plans are not really plans
Reading UK Government CTO David Knott’s article about preparedness for IT disaster recovery and how organizations might respond to the resulting impact to their technology services prompted some thoughts about what a good disaster recovery plan should look like and what place disaster recovery plans should take in an organization’s approach to recovery.
Knott talks about two specific problems that organizations have with disaster recovery plans:
“However, despite all of this preparation and expense, many organisations have two problems which mean that their disaster recovery plans may be no use in a real disaster. First, their disaster recovery plans aren’t really plans to recover from disasters. Second, their disaster recovery plans are just plans.”
Our experience working with customers shows that organizations use a variety of mediums, many inappropriate for disaster recovery, to capture and store their recovery activities. Tools such as Microsoft Excel and Word are often used to capture those instructions that the application or service owner hopes are never used.
But first we should ask the question - why is it that recovery plans are created that aren’t really plans for recovering from a disaster? Could it be that the creation and proving of those recovery plans become just a tick box exercise? Static forms such as spreadsheets and text-based documents are not well suited to capturing tasks that are complex and dynamic in nature. Nor are these forms of recovery plan well suited to integrating with an existing set of tools, which leads to an increased burden for the application or service owner when performing recovery activities.
What makes a real IT disaster recovery plan?
So what is a plan? A plan can be defined as:
- A detailed proposal for doing or achieving something, OR
- An intention or decision about what one is going to do
Both definitions imply or state that a plan should detail what it is that you are going to do. This would imply that organizations with disaster recovery plans have the intent to describe what they would do in the event of a disaster and specifically how they would recover technology applications and services. Knott’s comment infers something different, though - that plans are perhaps non-specific ways of responding to a given scenario or situation. That the plans might, on paper, look good, but in reality be non actionable.
If those plans weren’t fit for purpose, or were not actionable, did the creator of those plans consider appropriate scenarios? Were those scenarios “severe but plausible” (in the words of the UK financial services regulator)? Knott makes the following observation:
“…while recovery had been proven for every system in isolation, it had not been proven on a larger scale, such as a complete data centre failure…”
We do agree, though, that it is easier to construct a plan to pass a test - but this does beg the question, what are you testing if not your response to a specific scenario? This failure to test in a manner which supports what might actually happen in a disaster does expose organizations to risk. After all, how often does a significant incident occur at 8am on a Saturday morning?
Proving your recovery plans should therefore also consider how your people respond to an incident, ideally when not expecting or prepared for the event. In essence, proving the resilience of your organization as well as the technology services, infrastructure and applications you depend on.
What should a good IT disaster recovery plan look like?
- It should not be static - it should be executable, measurable, and dynamic and it should be possible to integrate with other platforms and data sources
- It should show an ideal path to recovery, potentially via automation, but also provide visibility to allow people to act and take alternative paths if some piece of technology is not available to you in a recovery
- It should be a good and cheaper alternative to the more costly exercise of replatforming, reworking or architecting your application or service to cope with a total loss type scenario (such as the loss of a public cloud region or datacenter)
- It should consider known scenarios but also severe but plausible scenarios - consider how multiple events or failures may compound to produce a scenario that you might previously have thought unlikely
- It should be multi purpose, able to cover individual application- or service-based recovery but also act as a component of a larger recovery event for a datacenter, facility, or public cloud region
- It should be easy to report on to demonstrate progress to wider teams without significant manual effort while executing your recovery
- It should give you objective timings of activities so that subsequent review of your recovery activity is not an emotive discussion and you can identify opportunities to further optimize your recovery plan
- Post-event review of your disaster recovery plan should be straightforward and so should providing evidence of your recovery activitiesto your compliance teams for internal and external consumption
Build a real IT disaster recovery plan with Cutover
IT outages can have a significant impact on revenue and cost you millions for every hour of downtime - plus negatively impact your customers and lead to regulatory scrutiny. Gain confidence in your IT disaster recovery testing and execution and reduce your disaster recovery preparation time by up to 80% with Cutover.
Find out more about using Cutover for IT disaster recovery or request a demo to see the platform in action.