Evaluating AI-Enabled SaaS: What's Real, What's Marketing
Key Facts: The AI-Washed SaaS Market
- Gartner predicts that by 2026, over 80% of enterprises will have used generative AI APIs or deployed applications, but a majority of "AI SaaS" products ship as thin wrappers over third-party foundation models rather than proprietary capability.
- MMC Ventures' landmark audit of European "AI startups" found roughly 40% showed no material evidence of AI in their product — the earliest formal measurement of AI-washing, and the gap has not closed in consumer SaaS.
- McKinsey's State of AI surveys show the average enterprise is still realizing most AI value from a small handful of use cases (coding, marketing content, customer ops), not from the broad "AI-everywhere" claims in vendor marketing.
- Stanford's AI Index documents that inference cost for GPT-3.5-class capability has dropped over 280x since late 2022, which is why so many "AI features" are now economically viable as a wrapper — and why the wrapper is not, by itself, defensibility.
- OpenAI, Anthropic, and Google foundation model APIs collectively power the overwhelming majority of AI features shipped in mid-market SaaS; the vendor's differentiator is typically the data pipeline and UX, not the model.
The VP of Operations had done everything right. She'd watched the demo three times. She'd checked references. She'd negotiated a reasonable contract. And six months after go-live, the "AI-powered automation" that had been the centerpiece of the pitch was used by approximately four people, generated outputs that required human review on ninety percent of cases, and had turned out to be — when the VP finally asked a developer to look under the hood — a GPT-4 API call with a custom prompt, wrapped in a nice UI.
Not a lie, exactly. GPT-4 was genuinely powering it. But calling a thin wrapper over a foundation model "AI automation" is about as accurate as calling a pizza delivered by car "automotive food delivery."
The AI SaaS marketing problem is this: "AI" has become a feature marketing label applied to anything from genuine model integration and proprietary training to a chatbot on a help page. The capability spectrum is enormous, and the marketing language doesn't differentiate between them. Gartner's AI hype cycle research tracks which AI capabilities have crossed from inflated expectations into productive deployment, a useful calibration for understanding whether any vendor's claimed capability is in production-ready territory or still ascending the hype slope. Every vendor has the word "AI" on their homepage. Almost none of them explain what their AI actually does, what it's trained on, or how it performs on your data specifically.
This guide is the evaluation framework that separates what's real from what isn't.
The AI Capture Test
The AI Capture Test is a three-part diagnostic for separating real AI capability from marketing veneer: (1) if you removed the AI feature today, would the product still function and deliver its core value — and if yes, the AI is a feature, not the product; (2) can the vendor explain what their system does that a direct OpenAI or Anthropic API call plus a competent prompt could not do — if they cannot, you are paying a wrapper premium; (3) does accuracy measurably improve on your data over time through fine-tuning, retrieval, or feedback loops the vendor controls — if not, the product is captured by its underlying foundation model and inherits all of its ceilings and failure modes.
The Capability Spectrum
Before evaluating any AI-enabled tool, understand where it sits on the capability spectrum:
Level 1: AI-branded features. Existing features (search, sorting, filtering, recommendations) relabeled with AI terminology. The underlying mechanism is rules-based or heuristic, not model-driven. Common in older platforms that have added AI marketing without AI capability.
Level 2: Foundation model integration. The vendor has integrated a third-party foundation model (GPT-4, Claude, Gemini) via API. The AI capability is real, but it's primarily driven by the underlying model's general capability, not the vendor's proprietary training or fine-tuning. The vendor's value-add is the prompt engineering, the data pipeline, and the UX.
Level 3: Fine-tuned models. The vendor has taken a foundation model and fine-tuned it on domain-specific data, often data from their customer base. The model performs better on domain-specific tasks than a general model would, but the underlying architecture is still from a third party.
Level 4: Proprietary models. The vendor has developed and trained their own model architecture. This is rare and expensive. Most SaaS vendors claiming AI capability are at Level 2 or 3.
Level 5: Genuine AI-native architecture. The entire product is designed around AI inference: not a bolt-on feature, but a core architectural decision. The product would not function without the AI component.
Knowing which level you're evaluating changes how you assess claims, what questions you ask, and what risk you're accepting. For the governance and policy layer that should govern which AI SaaS tools your teams can deploy, the AI governance policy for departments is the internal complement to this vendor-side evaluation.
The Five-Question AI Evaluation Framework
Question 1: What Model Powers It, and Who Owns the Model?
This question separates Level 1-2 from Level 3-5 and reveals the vendor's actual AI investment.
What to ask:
- What AI model or models power your AI features?
- Did you build the model, fine-tune a foundation model, or call a foundation model API directly?
- If you're using a foundation model API (GPT, Claude, Gemini), what happens if that provider changes pricing, availability, or API terms?
- If you've fine-tuned a model, on what data was it trained?
Red flags:
- Vendor declines to identify the underlying model
- Vendor claims to have built a proprietary model but can't explain the architecture or training approach
- Vendor depends entirely on a single foundation model API with no fallback
What good answers look like: "We use [Foundation Model] via API for [specific features]. We've also fine-tuned a custom model for [specific domain task] trained on [anonymized, consented customer data]. Our AI infrastructure is multi-model, so we can swap the underlying model if the provider changes terms."
Question 2: What Data Does the AI Train On?
This is the most critical data governance question for AI-enabled tools, and it's the one most vendors are evasive about.
There are three data regimes to understand. The NIST AI Risk Management Framework provides a structured approach to categorizing how AI systems interact with input data, specifically the distinction between inference-time processing and training-time data use that governs your privacy exposure.
Inference only (your data used for output, not training): Your data goes in, you get an output, and nothing about that interaction updates the underlying model. Your data is processed but not retained for training. This is the standard for enterprise AI tools with strong data governance.
Shared training (your data used to improve the model for all customers): Your data (or signals derived from your data) is used to update the model that serves all of the vendor's customers. This is how many consumer AI tools work. It's inappropriate for business data without explicit consent and a clear privacy framework.
Isolated per-customer training: The vendor trains separate model instances per customer. Your data improves only your model. This is technically more expensive and operationally more complex, but it's increasingly offered as a premium option for data-sensitive customers.
What to ask:
- Is customer data used to train your AI models?
- If yes, is it shared model training or per-customer?
- Can customers opt out of training data contribution?
- What data, specifically, is used for training: raw inputs, derived signals, or something else?
- Where is this documented in the DPA or data processing addendum?
Question 3: What Does the AI Actually Do vs. What the Human Still Does?
AI demos tend to show the best case: the model outputs a perfect draft, the automation completes the workflow, the insight surfaces at exactly the right moment. The real workflow includes the failure cases, the review cycles, and the tasks the AI still can't do reliably.
What to ask:
- In a typical production workflow, what percentage of AI outputs does a human review before use?
- What does a user do when the AI output is wrong? What's the correction workflow?
- What are the known failure modes, the tasks where the AI consistently underperforms?
- Is the AI fully automating a workflow, or augmenting a workflow that humans still complete?
The "what does the human still do" question is the most revealing. If the honest answer is "humans review everything before it goes anywhere meaningful," you're looking at an AI-assisted workflow, not an AI-automated one. That may still be valuable, but it's a different product than what the demo implied. For context on how mid-market teams are actually integrating AI tools into their workflows, the AI tools stack for mid-market guide covers which categories are delivering consistent ROI and which are still maturing.
Question 4: How Is Accuracy Measured and Reported?
Accuracy claims in AI demos are almost always run on the vendor's test data, in optimal conditions, with cherry-picked examples. What you care about is accuracy on your data, in your workflow, with your edge cases. Stanford's AI Index Report documents the consistent gap between benchmark performance on curated test sets and real-world performance on production data. This is a structural problem across AI systems that vendor-controlled demos systematically obscure.
What to ask:
- How do you define and measure accuracy for your AI features?
- What's the accuracy rate on production data vs. test/demo data?
- How does accuracy change as the input data quality varies?
- Are accuracy benchmarks available from customers in our industry and use case?
- How has accuracy changed in the past six months?
What to watch for:
- Accuracy claims with no methodology (e.g., "95% accurate" with no definition of what constitutes a correct output)
- Accuracy measured on inputs that are cleaner or more structured than your actual data
- Accuracy numbers that haven't been measured against production customer data
Question 5: What Happens When It's Wrong?
Every AI system produces errors. The question is whether the product is designed to surface errors gracefully, whether errors are contained, and whether the vendor takes responsibility for downstream consequences.
What to ask:
- How does the product surface low-confidence outputs to users?
- Is there an audit log of AI-generated decisions or outputs?
- What's the escalation path when an AI error causes a downstream problem?
- What's in the contract regarding liability for errors in AI outputs?
- How do customers report systematic errors, and how quickly are they addressed?
The AI Capability Evaluation Scorecard (20 Criteria)
Score each criterion 1-5. A total score below 50 suggests the AI claims are primarily marketing.
Model and Architecture (max 20)
- Underlying model clearly identified (1-5)
- Model architecture appropriate for the use case (1-5)
- Vendor has meaningful proprietary value-add beyond API call (1-5)
- Multi-model resilience (not single-point-of-failure on one provider) (1-5)
Data Governance (max 20) 5. Customer data not used for shared model training (or clear opt-out) (1-5) 6. DPA covers AI-specific data handling explicitly (1-5) 7. Data residency and processing location confirmed (1-5) 8. Data deletion process post-termination confirmed for AI-derived data (1-5)
Performance and Reliability (max 20) 9. Production accuracy rate documented with clear methodology (1-5) 10. Failure modes identified and communicated (1-5) 11. Low-confidence output surfacing built into the UX (1-5) 12. Accuracy on customer's actual data testable in POC (1-5)
Workflow Integration (max 20) 13. AI automates meaningful portions of the workflow (not just a sidebar suggestion) (1-5) 14. Human review points in the workflow are clearly designed (1-5) 15. Escalation path for AI errors is documented (1-5) 16. Audit trail of AI decisions is available (1-5)
Roadmap and Maturity (max 20) 17. AI features are in production (not promised roadmap items) (1-5) 18. Accuracy improvement trajectory over the past 6 months (1-5) 19. AI development team and expertise visible (1-5) 20. Customer references specifically for AI feature use (1-5)
Score interpretation:
- 80-100: Credible AI capability; proceed with POC
- 60-79: Partial AI capability; clarify gaps before committing
- 40-59: AI claims are primarily marketing; validate carefully before purchase
- Below 40: AI is superficial or rebranded; evaluate on non-AI merits only
The 15-Question Data Processing Questionnaire for AI Vendors
Send this before any contract discussion that includes AI features:
- What AI models or technologies power your AI features?
- Did you build, fine-tune, or API-integrate the underlying model?
- Is customer data used to train, improve, or update any AI models?
- If yes, is this shared across customers or isolated per customer?
- Can customers opt out of AI training data contribution?
- Where is the AI model running: on your infrastructure, a cloud provider, or the foundation model provider's infrastructure?
- What customer data specifically is processed by the AI? (inputs, metadata, derived signals?)
- How is AI-processed data handled differently from non-AI data in your privacy framework?
- Is there a specific AI data processing addendum to your DPA?
- Where is AI-processed data stored geographically?
- How is AI-generated output attributed in audit logs?
- What happens to AI-derived data when the customer contract ends?
- What are the known accuracy limitations of your AI features?
- What liability does the vendor accept for errors in AI-generated outputs?
- Can we run a 30-day POC on our own data with pre-agreed accuracy benchmarks?
The 30-Day AI Pilot Design Template
The best way to evaluate AI capability is a structured proof of concept on your own data.
Pre-POC setup (Week 0):
- Define the specific workflow the AI is meant to improve
- Document the baseline (current state without AI; see measuring SaaS ROI 90 days after purchase)
- Set pre-agreed success metrics: accuracy rate, time savings, human review rate
- Confirm data requirements for the POC environment
Week 1-2: Controlled testing
- Run the AI feature on a representative sample of your data
- Measure accuracy against your pre-agreed definition
- Document failure cases and review rate
Week 3: Edge case testing
- Deliberately test with messy, incomplete, or edge-case inputs
- Measure how accuracy degrades
- Document whether the product surfaces low-confidence outputs appropriately
Week 4: Workflow integration
- Test the AI feature in a simulated production workflow
- Measure actual time savings (not estimated)
- Get feedback from two or three team members who'd use it daily
POC success gate: If the AI feature meets your pre-agreed accuracy threshold and time savings target, you have evidence to support a purchase decision. If it doesn't, you have evidence to either renegotiate the scope or decline.
AI Marketing Terms Decoded
| Vendor Says | What It Often Means |
|---|---|
| "AI-powered" | At least one AI API call is in the product |
| "Machine learning driven" | Rules-based system with some statistical component |
| "Proprietary AI" | May be a fine-tuned version of a public model, not a built-from-scratch system |
| "Trained on billions of data points" | Uses a foundation model trained on public data |
| "Industry-specific AI" | Fine-tuned on some domain data; amount and quality unspecified |
| "Intelligent automation" | Automation with some conditional logic |
| "AI assistant" | Chatbot, often GPT-based with a custom prompt |
| "Predictive insights" | Statistical forecasting, accuracy varies widely |
| "Real-time AI" | API calls made during the user session, not pre-computed |
| "No-hallucination guarantee" | Retrieval-augmented generation (RAG) system; reduces but doesn't eliminate hallucination |
How Rework Thinks About AI Features
Rework ships AI features that augment the buyer's work, not replace the buyer's judgment. Inside Rework CRM and Sales Ops (from $12/user/month), AI drafts follow-up emails, summarizes deal history, and surfaces stalled pipeline — but a human always reviews and sends, because sales trust is a human contract. Inside Rework Work Ops (from $6/user/month), AI classifies incoming tasks, proposes assignees based on workload, and drafts status updates — humans still approve and own the outcome. We are transparent about the model layer: we use foundation models via API, we document what data goes to inference (and do not use customer data to train shared models), and we measure accuracy on customer data during onboarding rather than quoting demo-set benchmarks. Our posture is that the AI Capture Test applies to us too — and we'd rather ship fewer, honest AI features than plaster "AI-powered" across a feature list that would work the same without it.
Frequently Asked Questions
Frequently Asked Questions About Evaluating AI-Enabled SaaS
How do I tell real AI from AI-washed marketing?
Apply the AI Capture Test: remove the AI feature and see whether the product still delivers its core value, ask what the vendor does beyond a foundation model API call, and verify whether accuracy improves on your data over time. If a vendor cannot answer those three questions concretely, you are almost certainly looking at marketing rather than capability. The MMC Ventures audit that found 40% of "AI startups" had no material AI in their product relied on essentially the same three checks.
What are the red flags in an AI SaaS demo?
Red flags include refusing to identify the underlying model, accuracy claims without a methodology or test dataset description, demos run only on vendor-prepared data, and the phrase "proprietary AI" with no explanation of architecture or training approach. Another common red flag is an AI feature that silently calls OpenAI or Anthropic but is priced as if the vendor built the model — you are paying a wrapper premium for something your own team could prototype in a week.
Should I pay more for AI features?
Pay more only when the AI is measurably doing work a human would otherwise do, on your data, at acceptable accuracy. Run a 30-day pilot with pre-agreed accuracy thresholds and time-savings targets before accepting the AI premium. If the feature is a GPT-4 API call with a prompt, remember the underlying inference cost has dropped more than 280x since 2022 per the Stanford AI Index — the wrapper itself is not worth much unless the data pipeline, fine-tuning, or UX materially changes the outcome.
What's the difference between a wrapper on GPT and a defensible AI product?
A wrapper sends your input to a foundation model with a system prompt and returns the output; anyone with an API key can build one. A defensible AI product adds proprietary training data, fine-tuned or custom models, retrieval systems built on the customer's own data, feedback loops that improve accuracy per customer, and a workflow integration that is expensive to replicate. The test is whether a competent engineering team could rebuild the wrapper in two weeks. If yes, it is not defensible.
How do I evaluate AI accuracy before buying?
Require a 30-day proof of concept on your own data with pre-agreed accuracy metrics, sample size, and a clear definition of what counts as a correct output. Do not accept vendor-reported benchmarks from curated test sets — the Stanford AI Index documents a consistent gap between benchmark accuracy and production accuracy across AI systems. Measure accuracy at three data quality levels (clean, typical, messy) to see how the system degrades under realistic conditions.
What data risks are unique to AI-enabled SaaS?
AI tools introduce three risks that non-AI SaaS does not: customer data being used to train shared models that serve competitors, inference-time data being logged or retained by the foundation model provider outside your DPA, and AI-generated outputs that cannot be audited or explained when they go wrong. Mitigate by requiring an AI-specific DPA addendum, confirming in writing that your data is used for inference only (not training), and mandating an audit log of AI-generated decisions so errors can be traced and corrected.
Learn More
- The Pre-Purchase Vendor Diligence Checklist for Mid-Market Buyers: how AI evaluation fits into the broader diligence framework
- Security and Compliance Review: What a Mid-Market Buyer Should Actually Check: the expanded security layer for AI tools
- SOC 2, ISO 27001, and GDPR for Buyers: What Each Actually Covers: GDPR DPA requirements specific to AI data processing
- SaaS Contract Red Flags: Auto-Renewal, Usage Caps, and Termination Clauses to Watch: AI-specific contract clauses to watch for
- AI readiness assessment templates: how to evaluate your organization's readiness to operationalize AI SaaS before purchasing
- Measuring SaaS ROI 90 days after purchase: how to set up baseline measurement before deploying AI tools so ROI claims are verifiable

Head of Enterprise Solutions
On this page
- The AI Capture Test
- The Capability Spectrum
- The Five-Question AI Evaluation Framework
- Question 1: What Model Powers It, and Who Owns the Model?
- Question 2: What Data Does the AI Train On?
- Question 3: What Does the AI Actually Do vs. What the Human Still Does?
- Question 4: How Is Accuracy Measured and Reported?
- Question 5: What Happens When It's Wrong?
- The AI Capability Evaluation Scorecard (20 Criteria)
- The 15-Question Data Processing Questionnaire for AI Vendors
- The 30-Day AI Pilot Design Template
- AI Marketing Terms Decoded
- How Rework Thinks About AI Features
- Frequently Asked Questions
- Learn More