Operational resilience

5 resilience orchestration takeaways from DRJ Spring

4 minute read

I’ve just returned from the Spring DRJ Conference in Orlando, Florida where around 500 attendees listened to various speakers and shared their operational resilience stories and experiences. It was great to be there in person, interacting with attendees and vendors, and I wanted to summarize a number of themes I picked up from those sessions and conversations. These themes happen to dovetail nicely with how our customers are making use of Cutover, but also align, from a product perspective, with where we and our customers are looking to the future of operational resilience. 

The overriding was that the resilience of our systems, processes, and people is vitally important!

Disruption is Business As Usual

Disruption to services and normal operations is becoming a Business As Usual (BAU) activity. The world is an unstable place right now, with cyber attacks on the rise. In fact, the FCA just recently released a report saying that significant cyber attacks rose by over 50% in 2021 when compared to 2020. Accenture also pointed to an increase in cyber attacks in 2021, whilst events in Ukraine have prompted the UK’s National Cyber Security Center (NCSC) to warn organizations to take action to improve their resilience. We recently wrote an article on how to prepare for increased cyber threats due to conflict

Incidents will inevitably happen and how you respond to them is important. Recovery plans that take disruption into account are vital - particularly as firms adapt to the changing regulatory landscape and how those regulations place a burden on organizations to consider the impact on their customers. The aim should also be to learn from incidents and disruptions to build resilience into systems and processes.

Adaptive capacity is defined on Wikipedia as “the capacity of systems, institutions, humans, and other organisms to adjust to potential damage, to take advantage of opportunities, or to respond to consequences.

This is a great definition of what organizations need to do to become more resilient in their approach to dealing with incidents and disruption. We should be asking ourselves:

  • How can we learn from an incident?
  • How can we plan our responses so we can deal better with the consequences of an incident or disruption to services?
  • Are we sufficiently testing and proving that those responses are adequate in terms of completeness but also in timing, to meet stated recovery time objectives (RTOs)?

It’s time to move past spreadsheets

It was clear from our conversations at DRJ that there are a large number of organizations that:

  • Have not created runbooks or playbooks that detail their response plans to specific scenarios
  • Have not created and detailed their runbooks to an appropriate level (we see customers typically define their Cutover runbooks at a service, application, or utility level)
  • Have plans sitting in spreadsheets (or other tools not fit for purpose)
  • Have not digitized their runbooks (if they exist at all!) 

Long gone are the days when paper-based recovery plans were sufficient. Organizations today need their plans to be stored electronically, in the cloud, for use at a moment’s notice and accessible across the organization, no matter where teams are located.

Move your Operational Resilience program toward scenario-led planning and testing

With the changing regulatory landscape in front of us, there is a need to consider how organizations respond to various scenarios. Thus testing and planning for testing should consider a multitude of scenarios, including but not limited to:

  • Specific key or important service impact
  • Total loss of a datacenter
  • Partial loss of a datacenter
  • Loss of an office building
  • Specific infrastructure platform loss/impact

 

You should be making your testing multi dimensional, so you are continually preparing for a variety of scenarios that might occur. Consider the edge cases and prepare for failure - because no matter how good your security, it’s inevitable that something will happen eventually.

Practice how you play to develop muscle memory

Not only should we be digitizing our runbooks and creating runbooks that take into account different scenarios, but testing and proving those runbooks out on a regular basis. I wrote an article recently about NFL teams’ approach to trick plays, and what we might then learn from that when looking at an approach to Operational Resilience.

At the DRJ Conference, resilience expert Mark Heywood and CTO of Northpointe Bank, Mike Gibeaut expanded on this theme in their session. They talked about “practicing how you play” in the pursuit of developing and building muscle memory, so when incidents occur (which they will!), the organizational response is rapid, effective, and appropriate.

Effective orchestration is key

It was interesting to hear and see the term orchestration being used more readily at the conference. Effective orchestration is a key tenet for Cutover - we believe that the crossover between human-owned activities and those tasks that can (and should be) automated should be properly managed in a single place. Human-machine orchestration allows your people to focus on high-value activities whilst automating the low-value tasks.

 

We’re looking forward to the next DRJ Conference in Phoenix, where no doubt there will be more to learn and impart. Find out more about how Cutover can help with your Resilience program, or contact us if you have any questions.