The data quality problem that breaks AI

Key takeaways

Most AI failures trace back to data issues - A RAND Corporation study found that the majority of AI project failures link directly to data problems, with poor data quality costing organizations a large share of revenue each year
Bad data amplifies exponentially with AI - Small errors in training data lead to large-scale errors in outputs because AI learns and reinforces those flaws at scale
Real costs run into millions - IBM spent $62 million on Watson for Oncology before it was shelved, partly because training data contained hypothetical rather than real patient cases
Data culture matters more than tools - Organizations with strong data quality strategies see 70% increases in AI model performance, yet 63% of breached organizations lack AI governance policies

A Fortune report on MIT’s AI research hit me with a number I’ve been thinking about ever since. At least 30% of generative AI projects get abandoned after proof of concept. Not because the algorithms failed. Because the data was never ready.

The cost of bad data keeps climbing too - poor data quality eats a staggering share of organizational revenue each year through bad decisions, lost customers, and regulatory penalties. That’s not a rounding error. That’s a slow bleed most executives haven’t noticed yet.

What makes this frustrating is that most people describe the problem wrong. They call it “having bad data” and treat it like a one-time cleanup job. The actual problem is different: AI takes your existing data problems and multiplies them until they wreck everything you built.

Why AI treats bad data differently than other software

Bad source data trains AI, creates systematic bias, drives production decisions, and feeds back into data

Traditional software fails predictably when you feed it bad data. Wrong zip code? Error message. Invalid date? Rejected. Your accountant catches the typo before it reaches the IRS.

AI does something worse.

It learns from your mistakes.

Feed a machine learning model data where Black patients historically received less care due to systemic bias, and the algorithm scores them as less sick than equally ill white patients. Millions of patients get underserved. The AI didn’t malfunction. It perfectly learned the wrong lesson from historically flawed records.

Amazon’s recruiting tool discriminated against women because its training data contained mostly male resumes. The model basically concluded that being male correlated with being a good candidate. Technically accurate pattern recognition. A wrong conclusion.

Small biases become systematic discrimination. Minor inconsistencies become confident but wrong predictions. Incomplete records become billion-dollar mistakes. That’s the amplification effect, and it’s what makes data quality a fundamentally different problem for AI than it is for traditional software. This connects directly to why most AI readiness assessments are lying - they grade your governance docs, not your actual data.

The scale of the failure

I was working through AI readiness data and the numbers kept getting worse. Organizations will abandon the majority of AI projects in the near term. Not because the algorithms are bad. Because the data feeding those algorithms isn’t ready. This is one of the core reasons why AI projects fail at companies that skip the data work.

A 2025 AI Governance Survey found that while 30% of organizations have at least one AI model in production, less than 20% have implemented model cards, dedicated incident reporting tools, or regular red teaming exercises. And 63% of breached organizations either don’t have an AI governance policy or are still developing one.

Sixty-three percent.

Google’s diabetic retinopathy detection tool worked brilliantly in controlled experiments. Deployed in real clinics, it rejected more than 20% of images due to poor scan quality. The AI was trained on pristine lab conditions. Real-world data is messy, and nobody tested what happens at that intersection until it actually mattered.

IBM spent $62 million on Watson for Oncology at M.D. Anderson before the project was shelved. Watson gave erroneous cancer treatment recommendations, including prescribing bleeding drugs for patients with severe bleeding. The root cause? Training data contained hypothetical cancer cases instead of real patient data.

Real money. Real patients. Real consequences.

A paper examining 19 popular ML algorithms surfaced something that probably should have been obvious but wasn’t: systematic biases cause larger drops in model quality than random errors. Random noise averages out over enough examples. Systematic bias compounds. Think about what that means for any organization carrying decades of inconsistently collected records.

Walmart’s early inventory management AI attempts in 2018 show this clearly. Inconsistent product categorization across stores, incomplete historical sales data, varying data entry standards. The AI couldn’t distinguish signal from noise because the noise wasn’t random. It was systematically wrong in different ways at different locations.

The numbers from enterprise surveys are bleak: 96% of organizations engaged in AI projects have faced data quality issues. Eight out of every 10 projects stalled or got aborted. Only 35% of organizations have an established AI governance framework, and just 8% of leaders feel equipped to manage AI-related risks. The common thread across all of them is treating data quality as something to fix after the fact rather than a foundational requirement going in.

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” — Andrew Ng, CEO and Founder of Landing AI, IEEE Spectrum

Want a second pair of eyes on your situation? Blue Sheen is built for this.

What actually changes when you fix the data

Organizations implementing solid data quality strategies experience a 70% increase in AI model performance and reliability.

Not 7%. Seventy percent.

I think that number tells you something important about where we collectively are. The data quality problem is so pervasive that addressing it seriously delivers returns most technology investments can’t come close to touching.

The approach that actually works, as that same research reinforces, involves shifting from model-centric to data-centric thinking. Stop asking “which algorithm should we use?” Start asking “is our data actually ready for this?” Documented AI safety incidents rose from 149 in 2023 to 233 in 2024, a 56.4% increase according to Stanford’s AI Index. The stakes keep climbing while most organizations are still debating which model to pick.

Data readiness has concrete requirements. Your data needs consistent formats across sources. Silos where only certain teams can access certain datasets create integration nightmares. 72% of organizations cite data management as one of the top challenges preventing them from scaling AI use cases. Which tells you everything, really.

Missing records need flagging, not guessing. When training data has gaps, you need to understand why they exist. Was the information never collected? Was it collected but lost? Does the absence itself signal something important? AI can’t infer context you never provided. Will a smarter model change that? No.

Start here, not with the algorithm

Data quality problems hide in painfully tedious places. Inconsistent date formats between systems. Product codes that changed three years ago but old records still carry the old format. Text fields where different people typed “n/a” or “none” or “not applicable” or just left blank. Your AI treats each as distinct information.

An ArXiv paper on AI integration challenges identifies four critical challenges tied to generative AI and data quality. First, the volume of data required for training large models means even small error rates become massive absolute numbers of wrong examples. Second, data provenance and lineage tracking become nearly impossible at scale without systematic approaches. Third, real-time operations present different quality challenges than batch processing. Fourth, the same problems that break traditional AI amplify differently with generative models - this is partly why your embedding strategy and ingest pipeline matter as much as your model choice.

You can’t fix what you don’t measure. Data audits are probably the right first move, though I’m not sure there’s a single correct entry point. Not the compliance checkbox kind of audit. The kind where you actually sample your data, look at it, and ask straight: would I make correct decisions based on this?

Build monitoring for metrics that matter: completeness rates, accuracy checks against known ground truth, consistency across related fields, timeliness of updates. Organizations are now implementing continuous data quality checks that detect issues in real time and trigger corrective actions automatically, replacing periodic audits. When these metrics degrade, you need to know before your AI starts producing garbage at scale.

Watch for data poisoning threats too. Feeding AI-generated data back into AI models creates feedback loops that degrade quality over time. Regular audits and anomaly detection aren’t optional anymore.

Treating data quality as a continuous practice rather than a one-time cleanup is what separates organizations that get real value from AI from those that collect expensive lessons. With SOC 2 auditors now scrutinizing AI governance controls and data quality in ML pipelines, this isn’t something you can push to next quarter.

The uncomfortable truth is that the biggest AI failures of 2025 were organizational, not technical. Weak controls, unclear ownership, misplaced trust. Look, everyone wants to talk about model selection. Almost nobody wants to talk about whether their data deserves a model at all.

Key takeaways

Why AI treats bad data differently than other software

The scale of the failure

What actually changes when you fix the data

Start here, not with the algorithm

About the Author