The problem: An industry-stopping cloud failure
A Major Global Bank, a leading American multinational financial services company, faced an unprecedented crisis when a massive, 35.5-hour regional failure struck Amazon Web Services (AWS) US-EAST-1. Caused by a cascading DNS failure, the event took key AWS services offline.
For the Bank, the impact was immediate and far-reaching, spanning Digital Banking and eCommerce to Commercial Lending and Credit Services globally. Hundreds of services were potentially impacted.
The challenge was not just technical restoration, but one of scale and communication:
- Scale of coordination: Over 1,800 individuals across technology and business teams were required to coordinate the response.
- Duration and scope: The response team resolved the outage in 35.5 hours (from 3:20 AM on October 20th to 2:50 PM the following day), requiring continuous, high-intensity management.
- Visibility blackout: Without a central hub, tracking hundreds of impacted assets, managing more than 200 recovery tasks, and maintaining DORA-compliant communications would have descended into unmanageable chaos across countless email chains and phone calls.
- Senior stakeholder management: Senior managers and directors needed real-time visibility to inform multi-million dollar decisions.
The solution: Cross-functional visibility and orchestration
To mitigate complexity, the Bank utilized Cutover Respond as the central hub for major incident management, enabling true cross-functional orchestration. Traditional tooling couldn't scale to this level of complexity, they needed Cutover Respond to provide governance, visibility, and a central hub of communications.
The process unfolded as follows:
- Initial alert and handoff: The team detected and logged the incident in ServiceNow, the Bank's central system of record. This action immediately triggered the creation of the dedicated Major Incident Runbook within Cutover Respond.
- Centralized control: An assigned Major Incident Manager (MIM) took the lead, immediately paging out subject matter experts and stakeholders. Cutover Respond became the single platform that centralized all activities and assignments.
- Structured collaboration: The platform provided defined roles and tiered access. Representatives from individual impacted business units could join the runbook, adding tasks and users as needed, while maintaining organizational governance.
- Third-party integration: Crucially, the Bank leveraged the platform to invite AWS engineers directly into the runbook. This allowed for shared situational awareness and highly coordinated problem solving, with the Bank constantly monitoring, reporting on impacts, and pushing AWS for status updates.
- Communications command center: The response team managed 12 concurrent Zoom bridges and used integrated third-party chat to communicate via the incident runbook. This ensured all 1,800+ global participants, including senior leaders, were aligned on service availability, mitigation progress, and necessary user actions.
The outcome: Greater visibility and control under pressure
The use of Cutover Respond enabled the Major Global Bank to orchestrate a high-performance response, effectively mitigating potential chaos and demonstrating superior command and control during a cross-industry crisis.
Key results:
- Performance under pressure: Seamlessly executed over 200 coordinated recovery tasks
- Single pane of glass: The platform provided a unified visual dashboard for all incident activity, giving teams a single view of progress and eliminating time lost switching between multiple systems.
- Streamlined decision making: By consolidating all information, status updates, and communications in one location, the Bank ensured senior leadership had immediate, actionable visibility. This centralized command structure prevented what could have been tens of millions in extended downtime losses
- Effective stakeholder management: By establishing a transparent, rapid communication chain, the platform ensured the entire company, as well as external partners, had a real-time, unified understanding of critical service availability.
- Tasks captured for continuous improvement: The 200 activities captured in the runbook was 40x the usual number documented with manual methods after an event. This data can then be used as training tokens for AI so that next time there is an incident, the MIM can ask Cutover AI to create a runbook to respond more effectively.
Successfully managing an 1,800+ person, global, multi line of business, cross-industry crisis in a single place allowed the Bank to quickly return to normal operations and demonstrated exceptional control during an otherwise chaotic event. By leveraging Cutover Respond, the Bank transformed what could have been an uncontrolled, multi-hour outage into a model of coordinated global response.
