The problem: Highly manual and uncoordinated failover procedures
A financial services company had to perform data center failovers every six months involving around 50 applications that supported their loans and fees services. They were performing simulated failovers to test this around once a week but were not always consistent. The process for simulating the failovers was highly manual and involved over 14 teams that needed to be coordinated.
The process and execution tools were separate, as information, such as the list of steps required to carry out the testing, was held in Confluence while the steps were completed using various software tools such as Jenkins and Rundeck. Preparing for the test involved going between the source of information and the tools to check that everything was planned correctly. The Application Owner then had to set up a bridge call with the Application Team and spend three to four hours running the plan, during which time everyone involved had to be on the call and couldn’t work on anything else.
Post-event reporting was also highly manual and therefore time consuming, taking a further three to four hours after the test was complete and carrying a high risk of human error.
The solution: Automated runbooks and comprehensive dashboards for data center failovers
Now, Cutover acts as the central planning and execution hub for simulated and real failovers. Cutover hosts the process and connects to the technologies that are used during the event.
After building their failover plans in Cutover, the Application Team can now kick off an automated runbook, and, rather than spending the entire duration of the test on a bridge call, the people involved can work on other things and are notified when they need to log in and complete their tasks.
Thanks to integrations with Jenkins and Rundeck, all the information in those systems such as code and scripts is available in the Cutover runbook so it can be accessed without leaving the platform and is still mastered in one place. They have also integrated their Cutover account with Microsoft Teams to further improve communication and visibility across the organization during a failover.
Cutover’s dashboards provided huge value for measuring recovery time objectives (RTOs) against recovery time actuals (RTAs). To do this, the Application Team built a stream containing all the tasks they wanted to measure and the dashboard of the stream summary provided a forecast duration which they used to make a plan that fit their RTOs. At the end of the failover simulation, they compared this to the actual duration of the runbook, which is the RTA, to see if they were able to meet their desired RTO in practice.
Post-event, the Application Team has to record the RTA and the preparation and validation steps that were carried out in Fusion. The Application Owner can now download the task list from Cutover and import the files straight into Fusion, reducing post-event reporting from a three to four-hour process to five to ten minutes.
“The ability to leave comments in a centralized place and access performance metrics post-event has changed the game for our post-mortem activities and preparations for the next event.” - Application Owner
Centralizing all communicaitons, information, and data in one place has been hugely impactful for the Application Team and they are now running tests once a week or more.
The outcome: Faster recovery
Many of the activities for the failovers were run by scripts with some manual intervention. Cutover now enables the Application Team to integrate with those scripts to mature them past manual intervention, reducing the amount of time taken to complete those tasks.
The Application Team estimated that Cutover is ten times better than their previous manual way of working. The dashboards and post-event review capabilities have saved them around three hours per event of post-execution evidence gathering.
They now have a version-controlled source of truth for the failover runbook and the ability to review and approve the entire process within Cutover.