日本語

Data Readiness: The Prerequisite Most AI Projects Skip

Five-gate blueprint of data readiness checks before AI projects

Meet Priya. She runs a 120-person B2B services firm. Revenue is healthy. Her team has been growing steadily for four years.

Six months ago, she approved a $60K pilot: a predictive lead scoring tool integrated into the CRM her sales team had used since 2021. The vendor was confident. The demo was impressive.

Three months in, the scores felt random. Reps stopped trusting them. Nobody could explain why two of their best-fit accounts ranked low-priority while a dozen cold contacts showed as "hot." The vendor support team reviewed the setup, then sent back a two-page document about data completeness requirements she'd never seen before signing.

The AI wasn't broken. The data was.

Gartner reports that through 2026, organizations will abandon 60% of AI projects due to lack of AI-ready data. Not because of model quality. Not because of team skill. Not because the technology wasn't mature enough. The data wasn't ready.

This is the unglamorous prerequisite that most teams skip because it's boring. And it's decisive.

This article is for Priya, and for every founder, operations lead, or department head who wants to know whether their data is ready before they spend another dollar on AI tools.

What data readiness actually means

"Data readiness" doesn't mean perfect data. It means data that's good enough for the specific AI capability you want to use.

More precisely: data that is findable, accessible, structured, fresh, and permitted for AI use.

  • Findable: you know where the data lives and can reach it without a multi-week project
  • Accessible: the AI tool can read it via API, export, or native connector
  • Structured: it has enough schema and consistency for a model to learn patterns
  • Fresh: it reflects current reality, not what was true two years ago
  • Permitted: legal, security, and compliance have cleared it for AI use

Most teams discover they're weak on one or two of these dimensions. That's usually enough to kill a pilot.

The five failure modes

Knowing what makes data not ready is more actionable than knowing what makes it ready. Here are the five failure modes that kill AI projects before the model gets a chance.

Failure mode 1: siloed data

Your CRM has deal history, but it can't see support tickets. Your marketing platform knows every asset a prospect downloaded, but your sales tools can't see it. Your finance system has three years of payment history, but your customer success platform doesn't know which accounts are 60 days late.

This is the most common failure mode in mid-market companies, and it's invisible until you try to build something that depends on connected data. An Ingest capability can pull from one system. But the moment your AI needs to see the full customer picture (purchase history plus support interaction plus email engagement plus renewal signals), you need those systems talking to each other.

They usually don't. Not without real integration work that happens before you buy the AI tool, not after.

Failure mode 2: unstructured fields with no schema

Your CRM has a "Notes" field. So does your support platform, your project management tool, and your tracking spreadsheet. Every rep uses it differently. Some write paragraphs. Some write nothing. Some write "called, left VM" and some write "2/14: spoke w/ J. Chen, interested but needs CFO sign-off, budget ~$40K, Q2 timing."

Free-text fields with no schema are nearly useless for AI that needs to learn patterns. The Analyze capability can extract signal from unstructured text, but only if there's enough of it and it's consistent enough for a model to distinguish signal from noise. Teams often don't discover this problem until after integrating the tool. The model's outputs feel wrong, but the model is doing its best with inconsistent inputs.

Failure mode 3: missing context on records

A record exists in your database, but it's missing the fields that give it meaning.

Your CRM has 8,000 company records, but 40% don't have an industry tag. Your deal history goes back four years, but win/loss reason was only made mandatory 18 months ago.

For a Predict capability building a lead scoring model, those missing fields aren't a minor inconvenience. They're the training signal. If you don't have outcomes attached to inputs, you can't train a meaningful prediction model. Context is the connective tissue. Records without it are data points without meaning.

Failure mode 4: quality problems

Duplicates. Typos. Stale entries. A "company name" field with seven spellings of the same enterprise account. Deal stages that never changed because a rep forgot to update them.

Quality problems confuse models in ways that are hard to diagnose. A Generate capability fed inconsistent reference material produces inconsistent drafts. A lead scoring model trained on duplicate records over-weights certain characteristics because they appear multiple times. An anomaly detection tool learning from stale baseline data flags normal behavior as anomalous. The outputs feel wrong, but the problem isn't the model. It's the inputs.

Failure mode 5: access-restricted data

Your data exists. It's clean enough. It's accessible to humans. But your legal or security team has a policy that prevents feeding it into AI tools.

"No PII into ChatGPT" is a reasonable policy. But if the data your AI needs contains customer names, email addresses, or behavioral data tied to individuals, that policy may block the entire use case. An Execute capability that auto-sends emails needs contact information. A support triage tool needs to read ticket content. A document review tool needs the document.

Before piloting anything, check whether the data you'd feed the tool is cleared. Not just technically accessible, but legally cleared and policy-documented. That conversation needs to happen before the pilot, not after.

The five-question audit

You don't need a data science team to run this audit. You need 30 minutes with someone who knows your systems.

Question 1: Can I download the data my AI would need, today, without pinging IT? If not, you have an access dependency to resolve before any AI tool can do anything useful.

Question 2: Does every record have the fields the AI needs, or are they 40% null? Pull 100 records at random. If more than 20-30% of the key fields are empty or clearly wrong, you have a completeness problem.

Question 3: Is the data recent enough to reflect current reality? Lead scoring needs the last 12-18 months of deal data. If your clean data is two years old and your sales process changed 18 months ago, the model learns the old process.

Question 4: Is there one authoritative source, or four conflicting versions? "The CRM is source of truth, but sales keeps a spreadsheet, and finance has different numbers in the ERP" is a coherence problem. AI can't reconcile competing sources. Someone has to decide which system wins.

Question 5: Does legal or security have a policy for feeding this data to AI tools? Ask explicitly. In many mid-market companies, AI data policy hasn't been written yet. Create it before proceeding, not after.

If you can answer all five cleanly, your data is ready enough to start. If two or more give you pause, that's where your pre-AI investment should go.

The data readiness pyramid

Think of data readiness as a pyramid with five levels. Most teams need to climb from the bottom before the higher levels deliver value.

Level Name What it means
Level 1 Basic hygiene Deduplicated, non-null required fields, consistent schema
Level 2 Integrated Key systems joined or accessible from one place
Level 3 Labeled Training signal exists: outcomes attached to inputs
Level 4 Governed Compliance-cleared for AI use; policy documented
Level 5 Observable You know when data quality breaks, before the model does

Most mid-market teams starting an AI project are at Level 1 or partway through Level 2. That's fine. You can start AI work at Level 1 or 2. But you have to know which level you're at, because the capabilities you can deploy depend on it.

A team at Level 1 can run Analyze workflows from relatively clean text or structured records, and experiment with Ingest to get documents and audio into usable form. They can't yet run serious Predict workflows, because those require Level 3 (labeled historical data).

A team at Level 3 who hasn't done Level 4 is one vendor audit away from having to shut down their AI workflows. Governance isn't a nice-to-have. It's what lets you scale without rebuilding when policy catches up.

Level 5 is what separates teams that maintain AI value over time from teams whose pilots degrade silently. Observability means monitoring in place to catch data quality drops: fields going null, duplicate records accumulating, freshness falling behind. Without it, a model that worked six months ago may now produce garbage, and you won't know until a rep calls a dead account.

Minimum viable readiness per ACE capability

Not every capability needs the same data foundation. Here's the floor for each of the five:

Capability Minimum data requirement
Ingest Access to the raw source: API, file export, or native connector. The AI needs to be able to read from wherever the data lives.
Analyze Clean enough text or structured data, with sufficient volume (typically hundreds to low thousands of records) for patterns to emerge.
Predict Historical labeled data: outcomes attached to inputs. For lead scoring, you need past deals marked won or lost. For churn, you need past customers marked churned or retained. Without labels, there's nothing to predict toward.
Generate Context-rich reference material: product documentation, past examples of what "good" looks like, style guides, company voice. Generate is only as good as the context it's given.
Execute Write permissions to the target system, plus audit trail capability so you can trace what the AI did and reverse it if needed.

This table is practical for sequencing. If you have clean CRM data but no historical labels, start with Analyze and Generate, not Predict. Build the labeling habit while you run the lower-risk capabilities. By the time you have 12-18 months of labeled outcomes, Predict is within reach.

What to do when your data isn't ready

Most teams are in this position. Here's what actually works.

Start with the one system that is ready. Most companies have one data source that's cleaner than the others. Your support ticket system might be messier than your CRM, but if the CRM has three years of clean deal history with outcomes, start your AI work there. Pick the use case that fits your strongest data, not the use case you most wish you could do.

Run Ingest and Analyze first. These are read-only capabilities that produce insights without changing external state. Running them before Predict or Execute lets you generate value with lower data requirements while you improve quality for the higher-stakes capabilities.

Build labeling habits before you need a model. If you want lead scoring in 12 months, start requiring win/loss reason fields in your CRM today. Mandate them. When you're ready to train, the labels are already there.

Consider vendor AI that brings its own baseline. Products like Salesforce Einstein, HubSpot's predictive scoring, or Gong come with pre-trained models that carry some signal before you add your own data, which reduces the cold-start penalty for smaller teams.

Data readiness as a competitive moat

Here's the part that isn't obvious when you're in the middle of a frustrating pilot.

The teams that do the unglamorous integration work (cleaning their CRM, insisting on mandatory fields, joining their systems, documenting their data policies) are building a moat that model improvements can't erase.

Model quality is a commodity. OpenAI, Anthropic, and Google are racing to give you better models. In 18 months, the models you can access via API will be far more capable than today's. But a better model fed dirty, siloed data will still produce dirty results.

The companies that win the AI race over the next three years aren't necessarily the ones who adopted the latest model fastest. They're the ones who built the data foundation that makes models work. Clean data plus a basic model beats messy data plus the latest model, almost every time.

The boring work that makes AI projects succeed

These are the unglamorous tasks that determine whether your AI pilot actually delivers value:

  • Deduplicate your CRM contacts and accounts before connecting any AI tool
  • Make win/loss reason a mandatory field in your deal records (and backfill 12 months if you can)
  • Audit your most important free-text fields: are reps filling them? Are they consistent?
  • Map your data flows: what goes in and what comes out for every key system
  • Get your legal or security team to write your AI data usage policy before you sign a vendor contract
  • Identify your authoritative source of truth for each key data type: customer records, deal history, support tickets
  • Build a monitoring habit: who reviews data quality monthly, and what do they look for?

None of these are technically complex. All of them require sustained organizational will to actually do. That's the real reason most teams skip this work. It's boring, slow, and doesn't feel like "AI." But it's the most important work you'll do on your AI program.

The ACE Framework builds up from the data foundation covered here:

Boring beats brilliant. Get the data right, and the AI will surprise you. Skip it, and you'll spend six months wondering why the model is "broken" when the model is working exactly as it should.