Major incident management needs clear architecture, not another stack of overlapping platforms. Here are some suggestions on how to use the right tools in the right place.
Every quarter, the pitch decks land on your desk with a new variation of the same promise:
“Our platform now does observability, orchestration, incident response, and post-incident learning — all powered by AI.”
If you’re a senior technology leader responsible for infrastructure resilience, you’ve probably noticed the pattern. The boundaries between tooling categories are blurring. Vendors that were brilliant at one thing are now claiming to be brilliant at five. And the more they blur, the harder it becomes to make a confident architectural decision about something that genuinely matters: how your organization responds when things go wrong at scale.
You’re not alone in feeling stuck. This piece is written to help you step forward.
The problem isn’t the tools, it’s the noise
A senior executive at a major global financial services firm put it to me plainly:
“I have eleven vendors telling me their platform is the answer to major incident management. Three of them didn’t even play in this space eighteen months ago. How am I supposed to build a credible architecture when everyone’s claiming the same territory?”
It’s a fair question, and it’s one we hear across industries of every kind, from banking and insurance to telecommunications and critical infrastructure. The AI wave has given every vendor a reason to expand their narrative. Observability platforms now talk about orchestration. ITSM tools now talk about agentic resolution. Chat platforms now talk about intelligent workflows.
The result, if you’re the one holding the architecture pen, is paralysis. Not because you lack options, but because you have too many and the differentiation between them has become deliberately vague.
Start with a simple question: Who knows what in major incident management?
When you strip away the marketing noise, effective major incident management really comes down to how knowledge is structured and divided - and it’s important to be precise about that.
- Observability platforms know “state”: They hold your logs, traces, and metrics. They detect anomalies faster than any human can parse a dashboard. Platforms like Dynatrace, Datadog, and Splunk are extraordinary at this, and they’ve earned that position through deep, sustained investment in their niche.
- ITSM platforms know “design time data”: ServiceNow, BMC, and their peers are the systems of record that include change requests, configuration items, incident tickets, and audit history. That’s their domain and they’re deeply embedded in it.
- Chat platforms know “communication”: Teams, Slack, and other platforms where humans collaborate in the moment are valuable, but don’t facilitate structured work.
- Escalation platforms know “routing”: Companies such as PagerDuty get the right person on the line. They are critical, but limited.
Here’s the question that often goes unasked when evaluating major incident management solutions: “who knows work?”
Have you ever asked who knows which recovery task was assigned to whom, in what sequence, with what dependencies, at what time and, most importantly, whether it was actually completed? Who captures the execution, not just the alert or the ticket or the chat thread?
“Work” is done at the orchestration and execution layer. And, in most enterprises that I have spoken to, it either doesn’t exist as a distinct capability, or it’s being stitched together with scripts, spreadsheets, and heroics at 2am.
Why the “we do everything in major incident management” pitch should make you cautious
When an observability vendor tells you they now handle orchestration, ask yourself: Do I want the tool that detected the problem to also be the tool that decides how to fix it, especially with no independent verification?
This isn’t a theoretical concern. In regulated industries such as financial services, healthcare, and critical national infrastructure, governance requires separation of concerns. The entity that identifies an issue should not be the sole entity that governs its resolution. You need an independent and immutable audit trail. You need to show a regulator that the decision to act, the action itself, and the verification of the outcome were captured in a system that isn’t grading its own homework.
That architectural principle of an independent orchestration layer, agnostic to who or what is executing the work, isn’t just good engineering, it’s what the regulators and compliance officers expect.
What good major incident management architecture actually looks like
Rather than one tool stretching to cover everything, the enterprises that are making real progress on major incident management tend to converge on a layered model:
- Detection stays with your observability platform. Let Dynatrace or Datadog do what they do best, surface the signal from the noise.
- Record stays with your ITSM platform. ServiceNow remains the system of record. Nothing changes there.
- Communication stays in your chat platform like Teams or Slack. People need to talk during an incident; let them.
- Routing stays with your escalation tool. Solutions such as PagerDuty get the right people engaged.
- Orchestration and execution is the layer that ties it all together. This is where codified runbooks live, with the structured sequence of tasks, dependencies, human decisions, automated actions, and increasingly, AI agent actions that actually drive resolution. It’s where accountability and the evidence trail live.
When these layers are distinct, each tool does what it was built for. When they blur, you get fragility and you lose the governance that matters most when things go badly wrong.
Bringing AI agents into major incident management without the handwaving
Every vendor now talks about AI reducing Mean Time to Mitigation (MTTM), with some talking about Mean Time to Resolution (MTTR). Few can explain the mechanism without resorting to vague promises.
Here’s what a credible agentic architecture looks like in practice:
An incident fires. A signal-to-noise trigger invokes a runbook. In parallel, an ITSM agent pulls recent changes on affected configuration items for critical context, given that the vast majority of major incidents in large enterprises are change-driven. Simultaneously, an observability agent analyses logs and telemetry to surface a potential root cause.
These agents aren’t advisory sidecars. They execute tasks inside the runbook, alongside humans, with full accountability and an immutable audit trail. The Major Incident Manager sees every action - human, automated, or agentic in one operational space.
The key insight: the agents come from your existing tools. Your observability platform contributes its intelligence. Your ITSM platform contributes its context. But they operate inside an independent orchestration layer that provides the sequencing, governance, and evidence capture so nobody is grading their own homework.
The path from major incident management Groundhog Day to lights-out automation
Every Major Incident Manager knows the feeling. The 2am call. The same type of failure. The same scramble. It’s “Groundhog Day”.
The path out isn’t a leap to full automation. It’s a disciplined, evidence-based climb that gradually and safely introduces agentic AI.
- Structure your major incident management execution
Start with structured execution that captures every task, decision, and sequence in a way that turns each incident into a learning opportunity, not just a war story. Separate the structured work from the noise of the incident chat channel. The real training data for AI lives in the explicit record of what was done and what worked.
- Incrementally introduce AI agents to execute major incident management patterns
Build a catalog of component patterns. Start with the small, low-risk recovery actions where the blast radius is limited. A pattern should run 50 times with human approval at every step to ensure that it works reliably in these conditions. Then the human approval step can be removed, not because you trust AI blindly, but because you’ve built a body of evidence the regulator can examine: “This pattern executed 50 times without issue under these conditions. Risk: low. Human step removed.”
Start with twenty cheap, low-risk actions that fire automatically at the start of every incident. Then increase to thirty, then fifty. Progressively, more incidents will be resolved agentically, with the blast radius managed and the evidence trail intact.
One leading global financial services firm has set a three-year target: 95% of major incidents resolved by lights-out automation and agents. That’s not a slide deck aspiration, it’s a program of work with a clear architectural foundation.
How to choose the right major incident management solutions
If you’re feeling stuck in the noise, here’s a framework for making progress:
1. Insist on separation of concerns: Any vendor that tells you they do everything should trigger a governance question, not excitement. The best architectures have clear boundaries between detection, record, communication, routing, and execution.
2. Ask where the work lives: If you can’t point to a system that captures the structured execution of incident recovery, including all the tasks, sequences, dependencies, decisions, and outcomes, you have a gap. Scripts, spreadsheets, and chat threads don’t cut it - but you probably already know that.
3. Evaluate AI claims against mechanisms, not marketing: When a vendor says “AI reduces MTTM/MTTR,” ask how. Ask where the agents operate, who provides the governance, and what happens when the agent is wrong. If the answers are vague, the capability probably is too.
4. Look for a decade of scars, not a quarter of pivoting: The vendors who will serve you well in major incident management are the ones who’ve been living in this space through the hard years, not the ones who’ve just arrived because AI made it trendy. Depth of experience in your vertical matters. A lot.
This is the first in a series exploring how to build a credible major incident management architecture in the age of AI. Coming up: Why the system of execution is the missing layer, how AI agents work credibly inside the runbook, and how to bring your regulator along on the journey from human-in-the-loop to lights-out.
If this resonated with you, we’d welcome the conversation - book a demo to see the platform in action.
