As a tech CEO working with some of the world’s most sophisticated enterprises, I’m convinced that major incident management (MIM) is one of the next frontiers for Agentic AI, but it’s a hard nut to crack. You wouldn’t assign a brand-new SRE to lead a critical outage on day one and, similarly, you can’t just drop a generic AI agent into a major incident and expect magic. Here are a few key insights on making Agentic AI in MIM a reality:
1. No “one-size-fits-all” data
Every enterprise’s IT estate and incident processes are unique. It’s extremely hard to get good training tokens for an AI in this domain, because past incidents are rare and context-specific. A generic model won’t cut it, it needsenterprise-specific training. Think about it: each company’s incident history and tribal knowledge is like its moat of data. Privacy and compliance are paramount because incident data often includes sensitive information, so enterprises prefer private models over sharing raw telemetry. I suspect it won’t be long before we see cloud AI firms training bespoke models for the largest banks and other major organizations, if it’s not happening already. These tailored models will span tasks across front-office and back-office operations, all securely within the enterprise’s environment.
2. Graphs over logs – context is king
Traditional monitoring gives you logs and metrics. But to truly assist in an outage, an AI agent needs the graph of past actions that responders took to fix similar incidents. In other words, AI needs the sequence of human and automated steps, decisions, and communications during the incident and that timeline/graph is the real runbook.
Simply dumping a ticket and some log lines into ChatGPT misses the point. By capturing the incident workflow (like a directed graph of tasks), we give the agent situational awareness. Visualizing this workflow (think of it like a state machine of the incident) is also key to trust. It makes the agent’s logic transparent and easier to understand.
We’ve learned that trying to manage a live incident purely via CLI or free-form AI suggestions is underwhelming and without a higher-level orchestrator to map out the plan, it’s hard for humans to oversee or govern the AI’s actions. A layer-two enterprise orchestrator that maps those steps and integrates human approvals is essential for safety.
3. Enterprise data stays in the enterprise
No major bank or Fortune 500 company wants to hand over its entire incident database to a third-party “black box” for AI or any other reason. That data is part of their competitive advantage.
Instead, the trend is toward enterprise-specific AI co-pilots. For example, AWS’s approach (in partnership with tools like Anthropic’s Claude) is to let companies bring their data to a secure model instance, rather than sending it off-platform. Using standards like OpenTelemetry, all the logs and events from diverse systems can be centralized in the client’s cloud, and their private model can run inference models on it. In practice, this means the heavy lifting happens on first-party models that each enterprise owns or fine tunes. Vendor-provided agents will play a role too, but likely in a limited scope. For example, an AI agent from a software vendor might help automate that vendor’s product, for example, by bypassing the UI or speeding up workflows without requiring access to the whole “ocean” of data. As my CTO likes to say, “we want customers to give us a teaspoon of water, not the entire ocean.” In other words, vendors should only require minimal, relevant data to provide value, keeping the customer firmly in control of their crown jewels.
4. Trust but verify (and automate)
In high-stakes incidents, trust is everything. Early on, any AI agent must earn that trust. The best way is by providing complete transparency and maintaining human oversight. That means every suggestion an AI agent makes should come with an explanation or evidence such as “why is it recommending this?”
We need to log each action it considers, each tool it calls, each result – essentially an audit trail of the agent’s reasoning. Modern AI agent platforms are starting to offer this out of the box. For instance, AWS’s AgentCore can emit an OpenTelemetry trace of each agent session, capturing model calls, tool invocations, and reasoning steps. This kind of observability lets the incident commander replay and scrutinize what the AI did, which is invaluable for both real-time governance and post-incident learning.
In practice, initially we’ll keep the AI in an “assistive mode” meaning the agent suggests actions or even executes some low-risk tasks automatically, but anything impactful is routed to a human for approval. Over time, the system demonstrates reliability and builds up a track record with that company’s data so the AI can be granted more autonomy, but always with guardrails and instant fallbacks in place.
The bottom line
I’m optimistic about Agentic AI in major incident management and envision AI “co-pilots” that help resolve outages faster, more safely, and with less stress. But to get there, we must respect the nuances: tailor AI to each enterprise’s environment, keep the humans in the loop, and make the AI’s actions transparent via tools and visualizations. Our team here at Cutover is working on exactly this, in partnership with AWS and using cutting-edge models like Anthropic’s Claude. The goal is an AI that doesn’t replace the incident manager, but empowers them. AI agents handle the grunt work of sift-and-diagnose, surfacing insights, proposing next steps, and even executing fixes under watch. Done right, the future will be fewer outages, faster recovery, and SRE teams that can focus on strategic improvements rather than firefighting. Exciting times ahead for resilience in the AI era!
