Amit Kothari CEO of Tallyfy, AI advisor at Blue Sheen

AI incident response: Why most incidents are process failures

In brief

The most damaging AI incidents stem from process breakdowns, not technical failures. The AI Incident Database reached 1000 incidents by 2025, with GenAI involved in 70% of cases. Building incident response that addresses process causes rather than just technical symptoms is what prevents repeat failure.

Amit Kothari Follow 10k+

Sep 23, 2025 · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

AI incident response: Why most incidents are process failures

What you will learn

Most AI failures trace back to process issues, not just technical problems. Escalation and edge case response break down at the organizational level
Traditional incident response misses AI's unique failure patterns. Quality drift, bias amplification, and hallucination cascades need totally different approaches
Map what actually happens, not what should happen. Trace real workflows through Slack threads and workarounds, not official documentation
The 15-minute window is critical: contain or continue decisions must use pre-defined quality thresholds, not gut feelings

McDonald’s spent three years working with IBM to build AI-powered drive-thru ordering. The system was supposed to simplify orders and improve the customer experience. Instead, viral TikTok videos showed customers pleading with the AI to stop adding Chicken McNuggets to their order. One order reached 260 pieces. McDonald’s shut down the entire pilot in June 2024.

What happened in the aftermath? I’d bet the technical teams dove straight into the speech recognition model. Data scientists analyzed training data. Engineers tweaked parameters. The real problem was simpler and worse: they never built proper processes for handling edge cases, customer escalation, or graceful degradation when the AI confused sports terms with food orders.

Same story. Different AI system.

AI project failures frequently stem from organizational and process issues, not just technical ones, including miscommunication, stakeholder misalignment, and insufficient infrastructure. This mirrors the fragmentation problem we see with AI readiness assessments. Traditional incident response misses this.

The process failure pattern

The AI Incident Database reached its 1000th incident milestone in early 2025, with over sixty new incidents added in just two months. GenAI was involved in 70% of incidents, and agentic AI caused the most dangerous failures. What the numbers don’t show: most of these were preventable through better processes, not better algorithms. Does better AI fix this? No.

Air Canada’s chatbot told a customer he could book a full-price ticket and then apply for a bereavement fare discount within 90 days. That was wrong. The airline’s actual policy didn’t allow retroactive bereavement rates after travel was completed. When the customer asked for the difference back, Air Canada argued their chatbot was a separate entity responsible for its own statements. A tribunal disagreed and ordered the airline to pay damages. The technical failure was straightforward: the chatbot gave inaccurate policy information. The process failure was a nightmare. No oversight existed around what the AI could commit to on the company’s behalf.

The McDonald’s McHire platform breach in 2025 followed the same logic. Security researchers found the AI-powered hiring platform accessible through rubbish default credentials “123456/123456” with no multi-factor authentication. The breach exposed data linked to 64 million job application records. The AI worked fine. The process around securing it didn’t exist.

The failure modes repeat across organizations:

Poor change management - teams cobble together AI updates without rigorous testing procedures
Inadequate oversight - no clear authority structure for AI decision-making
Missing escalation paths - no human backup when AI systems hit edge cases
Weak monitoring - focus on uptime numbers instead of quality degradation

IBM’s 2025 Cost of a Data Breach Report found that 13% of organizations reported breaches of AI models or applications, with 97% of those lacking proper AI access controls. Shadow AI makes things worse: one in five organizations reported breaches due to unauthorized AI. Those breaches cost $670,000 more than standard incidents. Which is painful, when you think about it.

Why traditional classification fails AI

Traditional incident response categorizes by technical severity. P1 for service down. P2 for degraded performance. AI incidents don’t fit this model. Force them into it and you create real blind spots.

From what I’ve watched unfold in consulting work, the real damage usually looks like this:

Quality drift - model accuracy slowly degrades from 94% to 87% over six months
Bias amplification - AI recruiting tools systematically filter out qualified candidates
Context confusion - customer service AI provides confidently wrong answers
Hallucination cascade - AI-generated content includes false information that spreads

These problems often start with poor prompt design, something that proper prompt engineering practices can help prevent. None of them register as “outages” in traditional monitoring. But they can damage business reputation and customer trust more than a complete system failure ever would.

Steve Wilson’s OWASP Top 10 for LLM Applications 2025 introduced two new threat categories: System Prompt Leakage and Vector and Embedding Weaknesses. It reworked the 2023 Overreliance entry into a broader Misinformation category. Modern AI incident response systems now classify by business impact rather than technical severity:

Type A - immediate safety risk (autonomous systems, medical AI)
Type B - financial or legal exposure (decision-making AI, regulatory systems)
Type C - brand or reputation risk (customer-facing AI)
Type D - work efficiency impact (internal process AI)

A slowly degrading recommendation system needs different handling than a chatbot giving legal advice. Mind you, that distinction matters far more than whether the system is technically “up” or “down.”

Response procedures that actually work

NIST’s incident response guide makes it plain: effective response depends more on preparation and process than technical skill. The Cyber AI Profile (NIST IR 8596), published as a preliminary draft in December 2025, specifically addresses AI-related risks aligned with NIST’s Cybersecurity Framework 2.0. For AI systems, preparation is even more urgent.

Flowchart showing 15-minute AI incident decision from detection through Type A-D classification to contain or staged rollback

The 15-minute rule. You have roughly 15 minutes to make the critical decision: contain or continue. Unlike traditional systems where the choice is obvious (broken means shut down), AI systems often limp along providing “mostly correct” output. That “mostly” is where trust quietly erodes.

AI incident response frameworks recommend immediate isolation of affected systems, but this only works with pre-defined triggers rather than judgment calls made under pressure:

Quality threshold breach - accuracy drops below acceptable levels
Output anomaly detection - unusual patterns in AI responses
User feedback spikes - complaints about AI behavior
External notification - media coverage or regulatory inquiry

Communication matters as much as containment. Open, swift communication during incidents helps organizations recover faster and maintain customer confidence. As we’ve discussed in communicating AI changes effectively, the messaging needs to focus on human impact, not technical features.

Instead of “We’re experiencing technical difficulties,” try:

“We’ve temporarily paused our AI recommendations while we investigate quality concerns”
“Our customer service team is handling inquiries while we improve our AI responses”
“We’re reviewing our AI decision-making process to ensure fair outcomes”

The key difference: acknowledge the AI component explicitly. Customers understand system outages. They don’t understand why AI gave them wrong information, and vague language makes the distrust worse.

If you want to apply this thinking to your firm, Blue Sheen handles work like this.

AI incidents also require coordination across teams that don’t usually work together. Technical teams debug models. Business teams assess customer impact. Legal and regulatory teams evaluate liability. Communications teams handle public statements. Effective incident response requires clear escalation procedures and decision-making authority distributed across all these functions. The worst AI incidents happen when technical teams make business decisions or business teams make technical ones.

How to investigate AI failures properly

Root cause analysis for AI systems requires different approaches than traditional software debugging. The question isn’t just “what broke” but “why did we design it to break this way?”

Work through five layers:

Immediate cause - what triggered the incident?
Technical cause - why did the AI system behave unexpectedly?
Data cause - what in the training or input data contributed?
Process cause - which procedures failed or were missing?
Organizational cause - what cultural or structural factors enabled this?

Most root causes exist at layers 4 and 5. Process and organizational issues, not technical problems. Actually, “most” might be too strong. Many, certainly. Traditional incident response focuses almost exclusively on layers 1 through 3. That’s why the same failures repeat.

AI incident investigation also requires documenting both technical facts and human decisions: a technical timeline of what happened to the system, a decision timeline of who made which choices and why, data provenance showing what training data or inputs were involved, and a clear record of where existing procedures didn’t cover the situation.

Organizations with detailed AI incident documentation recover much faster from subsequent similar incidents. I think that benefit is actually conservative once you factor in the compounding value of preventing recurrence.

Building incident response that holds

63% of breached organizations either don’t have an AI governance policy or are still developing one. Of those with policies, only 34% perform regular audits for unsanctioned AI. Only 35% of organizations have established AI governance frameworks, and just 8% of leaders feel equipped to manage AI-related risks.

Eight percent.

That number frustrates me, because the gap between where organizations are and where they need to be isn’t a technology problem. It’s a process problem. Solvable, if treated seriously.

Run quarterly tabletop exercises specifically for AI incidents. Use realistic scenarios: gradual quality degradation over weeks, bias discovery in a hiring AI, hallucination in customer communications, a regulatory inquiry about AI decisions. Companies that conduct regular AI incident simulations resolve real incidents much faster than those that don’t.

Build the capability stack in this order:

Detection - monitoring that catches quality issues, not just outages
Assessment - rapid business impact evaluation
Communication - templates and approval processes ready before you need them
Technical response - containment and recovery procedures
Investigation - root cause analysis that includes process factors
Learning - post-incident improvement that prevents similar failures

(Consider using Tallyfy’s process documentation to standardize and automate your incident response workflows.)

Getting AI systems back online is only half the challenge. The other half is rebuilding trust. Never restore full AI functionality immediately after an incident. Use staged rollouts: manual mode first with humans handling inquiries, then limited automation for simple cases only, then monitored automation with closer oversight, then normal operations. Organizations using phased AI restoration report fewer repeat incidents compared to those that restore full functionality immediately.

When your website crashes, people understand. When your AI gives wrong answers, they question your judgment. Confidence rebuilding after an AI incident requires visible changes users can actually see, quality numbers shared openly where appropriate, easy access to a human when AI fails, and simple ways to report AI problems.

ISACA’s analysis of 2025 incidents confirms that the biggest AI failures were organizational, not technical: weak controls, unclear ownership, and misplaced trust. Organizations using security AI and automation save an average of $1.9 million per breach compared to those without. A practical incident-response framework for generative AI systems identifies six recurring incident archetypes and formalizes structured playbooks aligned with NIST SP 800-61r3, NIST AI 600-1, MITRE ATLAS, and OWASP LLM Top-10. Organizations can use these structured playbooks as a foundation for building AI-specific incident response capabilities.

Schedule regular reviews: monthly for detection capability assessment, quarterly for response procedure updates, annually for a full incident response system review. After every AI incident, document what you learned about system behavior, which processes need updating, how you’ll detect similar problems earlier, and what authority structures worked or failed. Share those lessons across teams. The AI incident you prevent is worth more than the one you handle perfectly.

Most AI incidents are process failures wearing a technical disguise. The organizations that recognize this, and build their response around human factors rather than just model monitoring, end up with more reliable AI and customers who actually trust them.

Fix the processes first. The technology is the easier part.

ai-operationsincident-responsesystem-reliabilityprocess-improvementai-management

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Contact me More about me

View All Posts »

Why your company processes should improve themselves using Claude

Processes that update their own documentation sound impossible until you see the architecture. An LLM compiler reads raw organizational data and writes structured process wikis that compound institutional knowledge automatically. Here is how to build it and what breaks.

AI errors need AI-level explanations

AI systems fail gradually and partially, not in clear binary states like traditional software. IDC research shows 88% of AI proof-of-concepts never reach production. The model gives a plausible answer missing important context, latency spikes but stays under timeout limits, outputs degrade invisibly. Your AI error handling must match this reality.

Stop experimenting with AI, start operating with it

According to MIT research, 95% of GenAI pilots fail to generate revenue. Experiments do not create business value. Operations do. Here is how to transition AI from pilot phase to operational integration.

AI operations: the missing discipline

Between technical MLOps and general business operations lies a missing discipline that determines whether AI creates lasting value or becomes expensive technical debt. With roughly 80 percent of AI projects failing in production, this ai operations framework applies Lean Six Sigma principles like continuous monitoring, quality assurance, and systematic improvement to AI systems at scale.

Why your AI pilots succeed but production fails

Pilots work because they are protected environments with dedicated resources. Production fails because it is the real world with real constraints. The gap is not technical - it is operational. IDC research shows 88% of AI pilots never reach production, not because the technology fails but because companies underestimate the operational readiness required.

Multi-model AI strategies - why diversity is your safety net

When ChatGPT went down for 12 hours in June 2025, thousands of businesses had no fallback. IDC predicts 70 percent of top AI enterprises will use multi-model routing by 2028. Task-specific routing can cut inference costs by up to 85 percent. Resilience through model diversity is not optional.