Your AI pilot worked. Users like it. The early numbers are encouraging. The sales development rep team is saving three hours a week on email personalization. Reply rates are up 12%. The pilot owner presents the results. Leadership nods.

Then someone asks: "How do we scale this?"

And suddenly nobody knows the answer.

This is pilot purgatory. It's the space between "the pilot worked" and "we have AI running in production." It's where most Stage 2 organizations get stuck, sometimes for a year, sometimes longer. Forrester identifies use case sprawl as one of the biggest barriers to scaling AI: enterprises often have dozens of pilots but lack a framework to prioritize and activate them. The pilot sits in limbo while the team debates infrastructure, waits for budget, argues about ownership, or simply runs another pilot because that feels safer than the commitment production requires.

Understanding why pilot purgatory happens is the first step to escaping it. If you're still in Stage 1, read Stage 1 to 2: From Ad-Hoc to Pilot first.

What pilot purgatory looks like

Key Facts: Pilot-to-Scale Challenges

Forrester identifies use case sprawl as one of the biggest barriers to scaling AI; enterprises often have dozens of pilots but lack a framework to prioritize and activate them into production (Forrester, 2025)

Organizations that have scaled AI (Stage 3+) generate 3-year total shareholder returns roughly 4x higher than AI laggards; the gap between leaders and laggards widens with each year of delay at Stage 2 (BCG, 2025)

McKinsey found that moving from successful AI pilots to enterprise-wide deployment requires adopting a scaling mindset, taking a modular approach to building AI assets, and shortening the analytics-development lifecycle, requirements that most pilot-stage teams are not structured to execute (McKinsey, 2025)

Pilot purgatory has consistent symptoms.

AI pilot purgatory shown as a ready launch capsule held down by unresolved tool, infrastructure, ownership, and timeline moorings

Pilots that stay in pilot for 12+ months. The time boundary set in the charter comes and goes. Nobody explicitly decides to scale. Nobody explicitly kills the pilot. It just keeps running at low volume, consuming some effort without delivering production-scale value.

Multiple competing tools being evaluated simultaneously. Instead of committing to one tool and scaling it, the team opens evaluations of two or three alternatives. "We want to make sure we pick the right one." Eighteen months later they still haven't picked one.

Executives waiting for "one more data point." The pilot produced good results, but leadership wants a larger sample size, a longer time window, or a more complex use case before committing. The bar keeps moving.

No infrastructure decisions made. The pilot ran on a vendor-managed cloud setup with no architectural commitments from the company. Scaling would require decisions about data storage, model hosting, API ownership, and monitoring that nobody has made yet.

The pilot team is burnt out. The same three people who ran the pilot are being asked to run the next two pilots while also figuring out how to scale the first one. They're stretched, unstructured, and starting to disengage. McKinsey's analysis found that making the transition from successful AI pilots to effective enterprise-wide AI requires adopting a scaling mindset, taking a modular approach to building AI assets, and shortening the analytics-development life cycle, none of which a burnt-out three-person team can do.

If two or more of these are true, you're in pilot purgatory. The exit requires execution discipline, not more evaluation.

"The Stage 2-to-3 transition fails most often not because the pilot was wrong but because the team treats production deployment as a larger version of the pilot. It isn't. Production requires infrastructure commitments, change management beyond volunteers, cost governance, and a named AI Operations owner. Companies that run Stage 3 with Stage 2 infrastructure face incidents, cost surprises, and adoption collapse." (Rework)

The Stage 2-to-3 Crossing Test

A four-requirement diagnostic confirming an organization has genuinely crossed from Stage 2 to Stage 3, rather than relabeled a scaled pilot as production. Requirement 1: One use case is deployed to all intended users (not a volunteer cohort), with monitoring active. Requirement 2: Shared infrastructure decisions are documented and committed (model provider, vector database, observability stack), not ad-hoc per use case. Requirement 3: A second use case is in active pilot using the Requirement 2 infrastructure as its foundation. Requirement 4: A named AI Operations owner exists with explicit accountability for production system uptime, incidents, and model versioning. Organizations that meet Requirement 1 only are running a scaled pilot, not a Stage 3 deployment.

Stage 3 exit criteria

Stage 3 isn't just "more pilots." It's a qualitatively different operating posture. Here's what must be true to claim it.

Requirement	What it means	Common gap
One use case in production	Deployed to all intended users, not a volunteer cohort. Full rollout with monitoring in place.	Still running at pilot scale for the "right" users only
Shared infrastructure decisions made	Vector database, model provider, guardrail tooling, and observability stack: selected and committed	Still using pilot's ad-hoc infrastructure. Each use case running on separate vendor stacks.
A second use case in active pilot	Proof that the learning from Pilot 1 transferred to Pilot 2. Not just planned: running.	All energy went to scaling Pilot 1. No capacity for new use cases.
AI Operations ownership	Named individual or small team responsible for production AI systems: uptime, incident response, model versioning	Pilot owner still running production alongside their day job

All four requirements must be true before calling the organization Stage 3. The most commonly skipped one is shared infrastructure. It's also the one that causes the most expensive problems if skipped.

The infrastructure decisions Stage 3 requires

Scaling pilots into production requires architecture choices that are easy to postpone and expensive to retrofit later.

Stage 3 AI infrastructure decisions shown as five production modules sharing one foundation

This is the part that surprises most transformation leads who come from a software-as-a-service buying background. AI at production scale requires architectural commitments that don't apply to ordinary SaaS tools. You need to make these decisions before scale, not after.

Vector database choice. If your AI applications involve retrieval-augmented generation (RAG), searching internal documents, or memory across interactions, you need a vector database. The major options in 2026 are Pinecone, Weaviate, Qdrant, and pgvector (for teams already on PostgreSQL). The choice matters less than making it. Running different use cases on different vector stores creates fragmentation that's expensive to unwind.

Model provider decisions. Which large language model (LLM) providers are you standardizing on? OpenAI, Anthropic, Google, or a mix? Enterprise agreements have different pricing, data handling terms, and performance characteristics. Using ChatGPT free tier for pilots is fine. Production workloads need proper enterprise agreements with data processing agreements (DPAs) and SLA commitments.

Guardrail tooling. How do you prevent the AI from producing harmful, incorrect, or off-brand outputs at scale? At pilot stage, a human reviews every output. At production scale, that's not possible. Guardrails run as pre- and post-processing layers around model calls. Options range from custom prompt engineering to purpose-built tools like Guardrails AI or LlamaIndex's safety layers.

Observability stack. How do you know when your AI is broken? At pilot scale, the pilot owner notices. At production scale, you need logging, alerting, and dashboards. Which metrics matter: response latency, error rate, fallback rate, human review triggers. This is the one infrastructure component most teams discover they need only after a production incident.

API ownership and rate limits. Who owns the API keys for your model providers? At pilot stage: probably the person who set up the pilot. At production scale: a service account owned by IT, with rate limits set to prevent cost spikes and with rotation policies for security.

These aren't IT decisions in isolation. They're architectural commitments that affect every AI use case you'll build afterward. Making them jointly, before scaling, saves months of rework.

The production deployment checklist

The checklist defines what production means for AI systems before users, customers, or executives start depending on the output.

AI production deployment checklist shown as a launch console protected by eight readiness safeguards

"We're going to put this in production" means different things to different teams. For AI systems, production has a specific meaning. These eight things need to be true.

1. Monitoring is active. Not "we'll check it occasionally." Active monitoring with alerts for error rate spikes, latency degradation, and unusual output patterns.

2. Fallback logic exists. When the AI fails (model timeout, guardrail trigger, rate limit), what happens? The user must not see a blank screen or a raw error message. Define the fallback: show cached output, queue for human review, gracefully decline.

3. Human-in-the-loop gates are defined. Which outputs require human review before action? For AI-drafted external communications, define whether the human reviews before sending (required) or after sending (too late). Set this rule before go-live.

4. Model versioning is tracked. When the model provider updates their model, your outputs may change. You need a record of which model version was running when, so you can diagnose unexplained output changes.

5. Incident response exists. If the AI sends wrong information to a customer, who do you call? In what order? What's the SLA for resolution? Write the runbook before you need it.

6. User communication is done. All intended users know the AI is running in production, understand what it does, and have been trained on how to review, correct, and escalate outputs.

7. Cost monitoring is active. Production AI workloads can generate unexpected cost spikes. Token consumption, API call volume, and model selection all affect cost. Someone owns the monthly cost report.

8. Data processing agreements are signed. If the AI processes personal data or company-confidential data, your vendor's DPA must be in place before production. "We'll add the DPA after we scale" is a compliance violation, not a backlog item.

This checklist is the difference between "deployed" and "production." Most teams hit 5 of 8. The three they skip are the ones that cause incidents.

Scaling change management: from volunteers to everyone

Pilots run on volunteers. The people in your pilot cohort raised their hands. They were curious about AI, willing to experiment, and tolerant of rough edges. They're not a representative sample of your full user population.

AI change management at scale shown as a volunteer lane widening into a supported workforce adoption concourse

Production scale involves everyone. And everyone includes the skeptics, the time-constrained, the "this seems like more work not less," and the ones who are quietly worried that if the AI does their job well, they won't have a job.

This is where Stage 2-to-3 transitions most commonly stall. The pilot worked on 20 sales development reps who volunteered. The full team has 80 SDRs, including 30 who have been doing this job for five years and are not enthusiastic about changing their workflow.

The change management playbook for Stage 3 has four parts.

Train for the new workflow, not the tool. Employees resist tools. They accept workflow improvements. Don't run "AI tool training." Run "new SDR workflow walkthrough" that happens to include the AI tool. Show the before and after. Make the time saving visible.

Give the skeptics a safe way in. Don't mandate the AI as the only path. Let skeptics use it optionally at first. The social proof of colleagues getting better results is more persuasive than an executive mandate.

Address the fear directly. If you don't name the "will this replace me?" concern explicitly, it will dominate every informal conversation about the rollout. Name it in the all-hands. Describe the role evolution: what the AI handles, what humans do that AI cannot. Be specific. Vagueness makes fear worse.

Measure adoption, not just outcomes. How many users are actually using the AI workflow daily? Weekly? Not at all? Track adoption as seriously as you track business outcomes. Low adoption explains underperforming outcomes, and it's fixable if you catch it early.

Governance at Stage 3: from draft to function

At Stage 2, you had a minimum viable policy. At Stage 3, you have AI running in production across multiple use cases. The governance requirements expand accordingly.

Stage 3 AI governance function shown as an approval, incident, policy review, and audit workbench

Tool approval function. Someone reviews new AI tool requests, applies the data classification framework, and issues approvals. Not a committee. A person with a process and a turnaround time.

Incident ownership. When a production AI incident occurs (wrong output, data exposure, model failure), who owns resolution? At Stage 3, this is a named role: AI Operations lead or equivalent. Not a shared email list.

Quarterly policy review. AI moves fast. Your approved tool list from six months ago may include tools that have changed their data handling terms. Build a calendar event: quarterly review, two hours, approved tools list plus incident log.

Audit trail baseline. For every Execute-capable AI workflow (AI taking actions in systems), you need a log of what the AI did, when, and with what result. This is a legal and compliance requirement at production scale in most jurisdictions, and it's the foundation for troubleshooting when something goes wrong. See Audit Trails for AI Execute Actions for what a production-grade audit trail requires.

Rework Analysis: Based on enterprise AI deployment patterns, the three infrastructure decisions that most often get deferred until after a production incident are: observability stack (teams discover they need it when they can't diagnose why outputs degraded), data processing agreements with model providers (addressed post-deployment when legal flags the compliance gap), and cost monitoring (surfaced when the first surprise invoice arrives). All three are inexpensive to implement before production. All three are expensive to retrofit after an incident. The production deployment checklist in this article is specifically sequenced to ensure these three are addressed before go-live.

The ROI recalculation at Stage 3

Pilots prove feasibility. Production proves economics. These are different questions.

A pilot with 20 users over 60 days that saves 3 hours per user per week looks like this in annualized terms: 20 users x 3 hours x 48 weeks = 2,880 hours saved. At an average fully-loaded cost of $50/hour, that's $144,000 in annualized labor value.

But what does it actually cost to run that AI in production? Model API costs, AI Operations team time, infrastructure costs, monitoring overhead. If the total annual cost of ownership is $40,000, the return on investment is $104,000 net. That's a clear business case.

Run this calculation before your Stage 3 scaling commitment. And use real numbers, not pilot-stage vendor pricing. Ask your model provider for production pricing at your expected token volume. Ask your AI Operations hire what their time costs. Add 20% for overhead you haven't anticipated.

Why AI ROI Is Hard to Prove gives you the full ROI measurement framework with the dimensions that matter: time saved, cost reduction, quality improvement, and revenue impact.

A real Stage 2-to-3 example

A 200-person SaaS company ran a successful pilot with an AI Support Agent in Q3 2025. The pilot showed that AI could resolve 35% of inbound support tickets without human involvement, freeing up 15 hours per week per support engineer.

Before scaling, they made three infrastructure decisions: standardized on OpenAI Enterprise for their model provider, committed to Pinecone as their vector database for the support knowledge base, and deployed LangSmith for observability.

They hired an AI Operations Lead in Q4 2025. That person owned the production deployment checklist, ran user training, wrote the incident runbook, and set up the cost monitoring dashboard.

By Q1 2026, the AI Support Agent was in production for the full support team. In parallel, they launched Pilot 2: an AI Sales Operations assistant that drafts call summaries and next-step recommendations directly in the CRM. That pilot is running now, with Pilot 1's infrastructure as the foundation.

They didn't run five pilots in parallel. They ran one well, committed the infrastructure, and used that foundation for the next one. That's the Stage 2-to-3 transition done right.

What comes next

Production deployment is Stage 3. The move to Stage 4 is a different order of magnitude: AI not as a layer on top of your workflows but woven into your core operating model. Most mid-market companies won't be ready for Stage 4 for another two to three years.

But understanding what Stage 4 requires helps you invest correctly at Stage 3. You're not just scaling tools. You're building the infrastructure and governance that Stage 4 depends on. That investment compounds forward.

Read: Stage 3 to 4: From Scaled to Integrated for the organizational and architectural requirements.

Read: The 5 Stages of AI Maturity to see the full maturity model and where Stage 3 sits.

Read: AI Incident Response Playbook before your first production deployment. You'll want it ready before you need it.

Frequently Asked Questions

What is pilot purgatory in AI transformation?

Pilot purgatory is the Stage 2 condition where AI pilots run indefinitely without a production deployment decision. The pilot produced encouraging results but leadership keeps requesting more data, more evaluation time, or parallel competing evaluations. Meanwhile, the pilot team burns out and other priorities absorb leadership attention. Forrester identifies use case sprawl as one of the biggest AI scaling barriers: organizations have dozens of pilots but lack the framework to activate them.

What are the Stage 2-to-3 exit criteria?

Four requirements must all be true: (1) one use case is deployed to all intended users with active monitoring, not just a volunteer cohort; (2) shared infrastructure decisions are documented and committed (model provider, vector database, observability stack); (3) a second use case is actively piloting using the shared infrastructure as its foundation; (4) a named AI Operations owner exists with explicit accountability for production uptime, incidents, and model versioning.

What is the Stage 2-to-3 Crossing Test?

The Stage 2-to-3 Crossing Test is a four-requirement diagnostic confirming genuine Stage 3 attainment. An organization that only meets Requirement 1 (full deployment to all users) but not the infrastructure, second pilot, or AI Operations requirements is running a scaled pilot, not a Stage 3 deployment. The distinction matters because scaled pilots still carry Stage 2 fragility: individual owner dependency, ad-hoc infrastructure, and no capacity to add a second use case without starting from scratch.

What infrastructure decisions does Stage 3 require?

Stage 3 requires five infrastructure commitments: vector database selection (if using RAG or document search), model provider enterprise agreements with data processing agreements signed, guardrail tooling to prevent harmful or off-brand outputs at scale, an observability stack for monitoring error rates and latency, and API ownership and rate limits transferred from individual accounts to IT-managed service accounts. These decisions affect every subsequent use case and are significantly more expensive to change after three use cases are built on ad-hoc infrastructure.

How do you manage change from pilot volunteers to full team rollout?

Scaling from pilots (who are volunteers) to full teams requires four actions: train for the new workflow, not the tool (employees resist tools but accept workflow improvements); give skeptics an optional entry point before mandating adoption; address the AI-replacement fear directly and specifically in an all-hands setting; and track adoption metrics (daily active users) as seriously as business outcomes. Low adoption explains underperforming outcomes, and it's fixable early but expensive late.

See also:

The Honest Cost of AI Transformation: the infrastructure investment required at Stage 3 modeled against business case
Sequencing AI Patterns in a Multi-Year Roadmap: how to prioritize your second and third use cases after Pilot 1 succeeds

About the author

Victor Hoang

Co-Founder, Rework.com

Victor Hoang is Co-Founder and CMO of Rework. He spent 12+ years scaling B2B SaaS growth, building a lead engine that generated over 1 million leads and $10M+ in annual recurring revenue. Today he builds AI agents and MCP servers into Rework's products to empower customers across growth and operations. He writes about what actually works.

View full profile LinkedIn