This week we were at the DRJ Spring conference in Orlando, Florida. It was great to be back at an in-person event to meet lots of people, have some great discussions, learn from the experts, and give live demos to the visitors to our booth. Thanks to everyone who stopped by to say hi and attended our talk: “Have we been fooling ourselves about testing?”
You can watch the full recording of the session below, or read our five key takeaways!
We learned a lot from the talks and subsequent discussions over the course of the event. In case you weren’t able to make it or wanted a refresher, we wanted to share the five key takeaways from resilience expert Mark Heywood and CTO of Northpointe Bank, Michael Gibeaut’s discussion about the reality of resilience testing.
- We may have been fooling ourselves about testing
All too often testing can be used as a box-ticking exercise to appease regulators and reduce anxiety, but how much of an indication is it for what would happen in the real world?
Does this scenario sound familiar?
- Data center tests typically take 12-16 weeks of planning
- The events involve 100+ people all working on the weekend
- On the day itself, you have an 85%+ success rate - a job well done!
- In reality, only about 20% of the time spent is on actual testing, with most of the time being spent on routine administration
Though these kinds of tests are often deemed a success, what would your success rate look like if you didn’t have those weeks or months to plan ahead? Or if you ran a test on any of the other 40 weekends in that year? Or if you had a real major incident or catastrophic loss?
Is there really any point in testing this way? If it wouldn’t work in a real-life invocation, what’s the point of doing it at all? Following this train of thought, Michael shared how his organization took testing in a radically different direction.
2. Practice how you play
"I practice as if I'm playing in a game. So when the moment comes in the game, it's not new to me.”
- - Michael Jordan
Although this quote is obviously about basketball, it is surprisingly apt for resilience testing, and Michael Gibeaut embraced this mantra to push for a new (and initially unpopular) way to ensure resilience, which they called “failover and stay.”
Rather than spending weeks and months testing for a hypothetical outage, they would fail their systems over to the secondary site, keep them there, and then use the next testing weekend to fail them back. Although this approach was initially controversial due to the perceived risk, the rest of the organization soon realized that instead of using mass amounts of resources going through arbitrary exercises the new system allowed them to identify and fix real problems and the added value was worth it. Instead of testing, they were now performing live exercises in a real crisis environment, meaning they would be 100% prepared for the real thing.
3. Build partnerships
True resilience involves the whole organization, so building bridges with other departments and forming real partnerships with technology and infrastructure teams is key.
Michael found that, although he faced pushback against the new way of doing things, once other departments started to see the value of this kind of testing they became advocates for better resilience and there was an enterprise-wide mind shift about the role of testing.
Rather than taking up peoples’ weekends to do arbitrary exercises, they became true partners with the lines of business, tech partners, infrastructure group, and app support and development teams, as well as compliance partners who worked with the regulators. When even teams that would not have been directly part of the testing now have resilience top of mind, it’s much easier to mobilize in the face of real issues.
4. Don’t be afraid to break things
We’ve all heard the Facebook motto “move fast and break things” but it’s not so easy to take that attitude in a highly-regulated industry where customer-facing issues have wide-ranging consequences. But sometimes it’s necessary to take a more “chaos monkey” approach and break things a little bit - to avoid a bigger and more catastrophic break later on.
A big advantage of taking an approach like Michael’s rather than running standard testing exercises is that it allows you to find real problems in your system that you may not otherwise uncover until it’s too late and they’ve caused real damage. For example, when Michael ran this exercise, uncovering these problems led to a change in the way they designed applications due to the data feedback from the exercise, making their systems more resilient overall.
5. Testing builds confidence if it’s done right
Dealing with threats in this more controlled way also builds muscle memory within the organization, so when a real threat is encountered, you know exactly what to do. And in today’s environment, it’s inevitable that you will have an outage or security threat sooner or later.
So what’s a scarier scenario? Having everyone on standby to respond to a controlled probe of your defenses, or to have everyone focus on testing that they know is not going to work in a real crisis?