More in
AI Team Readiness Playbook
How to Audit Your Sales Team's AI Readiness
4月 14, 2026
Building an AI Skills Matrix for Your Department
4月 14, 2026
90-Day Plan: From AI-Curious to AI-Fluent
4月 14, 2026
AI Tools Training Playbook for Non-Technical Teams
4月 14, 2026
Hiring vs Upskilling: Decision Framework for Directors
4月 14, 2026
Setting Up an AI Champions Program in Your Department
4月 14, 2026
Measuring AI Adoption ROI Across Your Team
4月 14, 2026
AI Onboarding Checklist for New Hires in 2026
4月 14, 2026
Building AI-Powered Workflows for Sales Teams
4月 14, 2026
Building AI-Powered Workflows for Marketing Teams
4月 14, 2026
Running AI Pilot Programs: A Step-by-Step Guide for Department Leaders
A Sales Ops Manager at a mid-size SaaS company ran the same AI pilot twice. The first time, she put together a 6-week trial with 8 reps, let them use the tool however they wanted, and collected feedback at the end. The results were mixed. Some reps liked it. Some didn't. No before/after data. No defined success criteria. The conclusion: "Let's revisit next quarter."
The second time, she started with a single workflow: how long reps spent on post-call CRM updates. She measured baseline first: 47 minutes per rep per day, averaged across 8 reps over two weeks. Then she ran the pilot with the same 8 reps, measuring the same metric every week. At week 6, post-call CRM update time averaged 11 minutes. She had her decision in 6 weeks, presented it to her VP and CFO in 15 minutes, and got approval for a full rollout in the same meeting.
The difference wasn't the tool. It was the design.
Most AI pilots fail to generate a decision. They run for several weeks, produce mixed anecdotal feedback, and end with "let's revisit." The problem isn't the AI. It's that pilots without success criteria can only produce inconclusive results. Harvard Business Review research on technology pilots found that the single biggest differentiator between successful and unsuccessful enterprise AI initiatives was whether success criteria were defined before the project began, not after data was collected. You'll run the same pilot again in six months unless you change how you design it. Before committing to any pilot, run the AI readiness assessment first — it tells you whether your data and process foundations can support a fair test.
What Makes an AI Pilot Different from an IT Trial
This distinction matters before you start. An IT trial answers: does the tool work technically? Does it integrate, does it perform, does it meet security requirements? That's a vendor's job to prove, often through a free trial period.
An AI pilot answers a different question: does this tool produce measurable business value for our team, in our context, in our workflows?
Those are separate evaluations, and they require different designs. IT trials are pass/fail technical assessments. AI pilots are business case validations. You need both, but they shouldn't be the same activity.
Common mistake: treating the vendor's free trial as the pilot. Vendor trials are designed to get you to the capabilities demo as fast as possible, not to validate your specific workflow improvement hypothesis. The 30-day free trial period is when you run IT due diligence. The AI pilot is what you run after technical validation is complete.
Before You Start: Four Prerequisites
Don't launch a pilot until all four of these are in place. Missing any one of them is the most common reason pilots produce inconclusive results.
1. A defined problem statement. Not "we want to explore AI tools." A specific workflow problem. "Reps spend too long on CRM updates after calls" is a problem statement. "We should look into AI" is not.
2. A measurable baseline. The metric you want to improve must have a current number attached to it before the pilot starts. If you don't have a baseline, your first two weeks of the pilot are spent establishing one, and you'll be tempted to start the clock before you're ready.
3. An executive sponsor. A pilot without a sponsor is a pilot that can die from a shift in priorities. Your sponsor doesn't need to be active day-to-day. They need to be committed enough to protect the pilot's timeline and unblock escalations when they happen.
4. A committed pilot team. Voluntary participation from people who will actually use the tool consistently during the pilot period. Reluctant participants produce noisy data. Consistent participants produce signal.
Step 1: Define the Pilot Scope and Hypothesis
A well-scoped pilot covers one workflow, one team, and one question.
Pilot Scope Template
Problem: [What specific workflow is slow, error-prone, or time-consuming?]
Hypothesis: If we use [tool/feature] for [specific workflow],
then [metric] will improve by [target] within [timeframe].
Success Metric: [Single primary metric. E.g., "time per CRM update,"
"content brief turnaround time," "weekly report generation time"]
Baseline: [Current measured value of the success metric]
Secondary Metrics: [2-3 supporting metrics. E.g., adoption rate,
user satisfaction score, output quality rating]
Timeline: [Start date → End date, typically 4-8 weeks]
Team: [Names and roles of pilot participants]
Exclusions: [What this pilot will NOT evaluate]
Fill in every field before the pilot starts. The exclusions section is underused but important: it prevents scope creep and gives you a clear answer when someone asks "but did you test it for X?"
A good hypothesis is falsifiable. "AI will help our team" is not a hypothesis. "Using AI meeting summaries will reduce post-meeting action item follow-up time from 25 minutes to under 10 minutes per meeting" is.
Step 2: Set Baseline Metrics Before Day One
You cannot measure improvement without a baseline. This sounds obvious, but most pilots skip it or defer it.
How to capture baseline data.
For time-based metrics: use a simple self-reporting log for one to two weeks before the pilot starts. Ask participants to track time spent on the specific task, once per day, for 10 business days. Average across the group.
For volume-based metrics: pull the historical average from your existing tools if the data is there. Two weeks of recent history is usually sufficient.
For quality-based metrics: have participants rate their current output quality on a 1-5 scale before the pilot. This is subjective, but the before/after comparison is still meaningful.
Common baseline metrics by department.
| Department | Workflow | Baseline Metric |
|---|---|---|
| Sales | Post-call CRM updates | Minutes per update per rep per day |
| Sales | Deal review preparation | Hours per manager per week |
| Marketing | Content brief creation | Hours per brief |
| Marketing | Weekly campaign report | Hours from data pull to final report |
| Operations | Weekly status reporting | Hours per report per manager |
| Customer Success | Call summary and follow-up | Minutes per customer interaction |
| HR | Job description drafting | Hours per JD from request to final |
Pick the metric that represents the most time-consuming or error-prone part of the workflow you're targeting. Secondary metrics matter, but the primary metric is what drives the go/no-go decision.
Step 3: Select Pilot Participants
Ideal pilot cohort size is 5-12 people. Smaller produces insufficient signal. Larger makes the controlled environment hard to maintain. The change management playbook for AI rollout covers the emotional layer of participant selection in more depth — specifically why skeptics in the cohort are not optional and how to frame the invitation to reluctant participants.
Cohort composition.
Include 3-5 early adopters: people who have used similar tools before, who responded positively to the concept, or who volunteered. These participants will adopt quickly and establish best practices that you can spread to the rest of the cohort.
Include 2-3 solid mid-performers: people who are competent and consistent but not enthusiasts. They represent the average experience and produce the most reliable baseline comparisons.
Include 1-2 skeptics: people who expressed doubts, who have more to lose from workflow disruption, or who were explicitly unenthusiastic. This is not optional.
Why skeptics are not optional.
When a skeptic adopts and reports positive results, the rest of the team believes it. Adoption is a social process. People don't evaluate tools in isolation. They watch what their peers experience. MIT Sloan research on workplace technology adoption documents this phenomenon specifically: peer validation from credible skeptics is more influential on broader team adoption than any formal training or executive sponsorship. If your pilot cohort contains only enthusiasts, your report will be dismissed as selection bias, because it is.
Ask your skeptic directly: "I want to include you in this pilot specifically because I know you have reservations. Your perspective will make the results more credible. Are you willing to commit to using the tool consistently for six weeks and giving us honest feedback?" Most people say yes when asked that way.
Before finalizing the cohort: confirm that each participant can commit to the pilot timeline without major interruptions (vacations, project crunches, role changes). One week of absence from a 6-week pilot distorts that person's data significantly.
Step 4: Design the Pilot Timeline
A 6-week pilot is the right default for most AI workflow tools. Four weeks is too short to distinguish early adopter behavior from sustained habit. Eight weeks risks losing urgency and participant engagement.
6-Week Pilot Calendar Template
| Week | Objective | Activities | Data Collected |
|---|---|---|---|
| Week 1 | Onboarding and first use | Kickoff session (90 min), tool setup, first task completion | Tool login confirmation, first use date |
| Week 2 | Habit formation | Individual use in target workflow, daily log | Weekly time log, adoption rate |
| Week 3 | Expand usage | Apply to secondary use cases identified by participants | Weekly time log, qualitative feedback |
| Week 4 | Troubleshoot blockers | Weekly check-in, address friction points, champion peer coaching | Blocker log, satisfaction score |
| Week 5 | Volume and consistency | Full workflow integration | Weekly time log, output quality rating |
| Week 6 | Measurement and readout | Final data collection, participant survey, results analysis | Final metrics vs. baseline, NPS, decision recommendation |
Note check-in points at the end of weeks 2 and 4. These are not optional reviews. They're when you catch participation drop-off before it's too late to address it.
Step 5: Run a Structured Kickoff Session
The kickoff session sets the behavioral frame for the entire pilot. A poorly run kickoff produces inconsistent participation and inconsistent data. Keep it to 90 minutes.
90-Minute Kickoff Agenda
| Time | Topic | Who Runs It |
|---|---|---|
| 0:00-0:10 | Why this pilot, why now, what we're testing (context) | Pilot lead |
| 0:10-0:25 | Live tool demo focused on the target workflow only | Pilot lead or vendor |
| 0:25-0:45 | Hands-on setup: every participant logs in and completes one task | All participants |
| 0:45-1:00 | Baseline logging instructions: how to fill in the weekly log | Pilot lead |
| 1:00-1:10 | Q&A: only questions about how to use the tool or log data | All |
| 1:10-1:20 | Pilot norms: how to flag blockers, when check-ins are, who to contact | Pilot lead |
| 1:20-1:30 | Buffer and individual setup help | All |
Two things to skip in the kickoff: extended feature demos of things you're not testing, and open-ended discussion about whether AI is good or bad. Save those conversations for the retrospective.
Every participant should leave the kickoff with tool access confirmed, at least one task completed, and a clear understanding of how to log their weekly data.
Step 6: Collect Data Weekly, Not Just at the End
End-of-pilot surveys produce recall bias. People remember the last two weeks, not the first four. Weekly data collection throughout the pilot is more accurate and more useful.
Weekly Pilot Log Template
Send this to each participant every Friday during the pilot:
Week [N] Pilot Check-In
1. How many times did you use [tool] this week for [target workflow]?
□ 0 □ 1-2 □ 3-5 □ 6+
2. Estimated time spent on [target workflow] this week (total hours/minutes):
___________
3. Any blockers or friction points this week? (Brief description or "none")
___________
4. What worked well this week? (Optional but encouraged)
___________
5. Satisfaction with [tool] this week: 1 (very dissatisfied) to 5 (very satisfied)
___________
Keep it to 5 questions and under 3 minutes to complete. If it takes longer, people stop doing it. Use the responses to catch participation drop-off in week 2 or 3, not after the pilot ends.
Review responses within 24 hours of receiving them. If someone logs 0 uses in week 2, follow up directly. Don't wait until week 4 to discover that half your cohort stopped using the tool.
Step 7: Analyze Results Against Baseline
At the end of week 6, you have six weeks of weekly logs plus a baseline measurement. Analysis is straightforward.
Time saved calculation:
Weekly time saved = (Baseline time per week) - (Pilot week average time per week)
Annual time saved per person = Weekly time saved x 48 working weeks
Team annual time saved = Annual per person x cohort size
Adoption rate:
Adoption rate = (Participants with 3+ uses per week in weeks 4-6) / (Total participants)
Use weeks 4-6, not all 6 weeks. Weeks 1-3 include the learning curve. The sustainable adoption number is what weeks 4-6 show.
What "good enough" looks like for a go decision.
There's no universal threshold, but these guidelines apply for most workflow AI tools. Deloitte's research on AI implementation found that initiatives showing less than 20% improvement in their primary workflow metric within the first 60 days rarely recovered to meaningful ROI at 12 months:
- Primary metric improves by at least 20% compared to baseline
- Adoption rate in weeks 4-6 is at least 60% of cohort
- Average satisfaction score is at least 3.5 out of 5
- No critical technical blockers remain unresolved
If all four are met, you have a go signal. If two or fewer are met, you have a no-go signal. If three are met and one is borderline, you have grounds for an extension.
Step 8: Write the Pilot Readout
The readout document is what you present to finance, IT, and leadership. It should be one to two pages. Longer documents produce more questions, not more confidence. If you're building toward a budget request from the pilot results, see the AI training budget business case guide — it has the three-scenario ROI model that turns pilot data into numbers a CFO will trust.
Pilot Readout Document Template
EXECUTIVE SUMMARY
[2-3 sentences: what we tested, what we found, what we recommend]
PILOT SCOPE
Tool: [Name and function]
Workflow tested: [Specific workflow]
Team: [Roles, not names]
Timeline: [Start → End]
METRICS VS. BASELINE
| Metric | Baseline | Pilot Average (Weeks 4-6) | Change |
|---|---|---|---|
| [Primary metric] | [Value] | [Value] | [%] |
| Adoption rate | 0% | [%] | — |
| Satisfaction score | — | [X]/5 | — |
TEAM FEEDBACK
"[Quote from early adopter]"
"[Quote from skeptic — include this one especially]"
"[Quote from a mid-performer]"
WHAT DIDN'T WORK
[Honest description of friction points, integration issues, or use cases that underperformed]
RISKS
[2-3 risks for full rollout and how you'd address them]
RECOMMENDATION
□ Go — proceed to full rollout
□ Extend — re-test with adjustments [describe adjustments]
□ No-go — do not proceed [describe what would need to change]
If go: Estimated rollout timeline and resource requirements
If no-go: Conditions under which we'd re-evaluate
The "What Didn't Work" section is not optional. Readouts without honest friction point documentation read as sales documents, not evidence. Finance and IT will discount them. Include the problems and your plan to address them.
Go/No-Go Decision Framework
Three questions determine the decision.
Question 1: Did the primary metric improve by at least 20%? If no, the tool doesn't solve the problem you identified. No-go.
Question 2: Will adoption hold at scale, or was this cohort unusually motivated? Evaluate this by looking at your skeptic's data. If your most reluctant participant adopted and reported improvement, adoption at scale is plausible. If only your early adopters adopted, you have a motivation problem, not a tool problem.
Question 3: Are the unresolved blockers fixable before rollout? List every blocker from the weekly logs. Categorize each as: (a) already resolved, (b) resolvable before rollout with a clear owner, or (c) not resolvable in the current tool/configuration. If category (c) blockers affect more than 20% of your proposed rollout scope, extend the pilot or return to vendor evaluation.
When to extend vs. decide vs. kill.
Extend when: you have strong signal on the primary metric but low adoption due to a specific fixable blocker. Add 2-3 weeks, fix the blocker, and re-measure.
Decide when: your three questions produce consistent answers, positive or negative.
Kill when: you have at least two consecutive pilots producing the same inconclusive results on the same blockers. This means the tool doesn't fit your context, not that you need a better designed pilot.
Common Pitfalls
Pilots with no control group. If everyone on the team uses the tool, you have no comparison baseline for "what would have happened without it." For your primary metric, try to keep a small group not using the tool so you have a counterfactual. Even 2-3 people in a non-pilot condition helps.
Success criteria set after the fact. Defining success after you see the results is not piloting. It's rationalization. Criteria set after the fact will always support whatever outcome is politically convenient. Write them down and lock them before week 1 starts.
No escalation path for blockers during the pilot. When a technical blocker hits in week 2 and nobody knows who owns it, participation drops and data quality degrades. Before kickoff, assign a single owner for technical escalations and a response SLA (24 hours is reasonable).
What to Do Next
If go: use your pilot documentation as the rollout blueprint. The workflow, training approach, and success metrics you validated in the pilot become the template for the full rollout. Don't redesign from scratch. You already have evidence for what works. The AI tools stack guide has the 6-month rollout sequence that shows how pilot results from Layer 2 tools feed into Layer 3 analytics readiness decisions.
If no-go: document what would need to change before you'd re-test. Specifically: which metric would need to improve, what technical blocker would need to be resolved, or what workflow change would need to happen first. File this and revisit in one quarter. If the conditions haven't changed, don't re-pilot.
Either way, share the pilot readout with your team. People who participated in a pilot that produced a no-go decision need to know the result, the reasoning, and what it means for their workflow going forward. Silence after a no-go is how you lose credibility for the next pilot.
Related guides:
- Change Management Playbook for AI Rollout
- AI Tools Stack for Mid-Market Teams: CRM, Productivity, Analytics
- Measuring AI Adoption ROI Across Your Team
- AI Training Budget: How to Make the Business Case
- Upskill vs. Hire AI-Native: The ROI Case
- AI Skills Gap: What Executives Are Getting Wrong
- Bootcamp vs. University AI Talent Pipeline in 2026
Learn More: How to Run an AI Proof of Concept That Finance Will Fund

Co-Founder & CMO, Rework
On this page
- What Makes an AI Pilot Different from an IT Trial
- Before You Start: Four Prerequisites
- Step 1: Define the Pilot Scope and Hypothesis
- Step 2: Set Baseline Metrics Before Day One
- Step 3: Select Pilot Participants
- Step 4: Design the Pilot Timeline
- Step 5: Run a Structured Kickoff Session
- Step 6: Collect Data Weekly, Not Just at the End
- Step 7: Analyze Results Against Baseline
- Step 8: Write the Pilot Readout
- Go/No-Go Decision Framework
- Common Pitfalls
- What to Do Next