Español

Evaluating AI-Enabled SaaS: What's Real, What's Marketing

The VP of Operations had done everything right. She'd watched the demo three times. She'd checked references. She'd negotiated a reasonable contract. And six months after go-live, the "AI-powered automation" that had been the centerpiece of the pitch was used by approximately four people, generated outputs that required human review on ninety percent of cases, and had turned out to be — when the VP finally asked a developer to look under the hood — a GPT-4 API call with a custom prompt, wrapped in a nice UI.

Not a lie, exactly. GPT-4 was genuinely powering it. But calling a thin wrapper over a foundation model "AI automation" is about as accurate as calling a pizza delivered by car "automotive food delivery."

The AI SaaS marketing problem is this: "AI" has become a feature marketing label applied to anything from genuine model integration and proprietary training to a chatbot on a help page. The capability spectrum is enormous, and the marketing language doesn't differentiate between them. Gartner's AI hype cycle research tracks which AI capabilities have crossed from inflated expectations into productive deployment — a useful calibration for understanding whether any vendor's claimed capability is in production-ready territory or still ascending the hype slope. Every vendor has the word "AI" on their homepage. Almost none of them explain what their AI actually does, what it's trained on, or how it performs on your data specifically.

This guide is the evaluation framework that separates what's real from what isn't.

The Capability Spectrum

Before evaluating any AI-enabled tool, understand where it sits on the capability spectrum:

Level 1: AI-branded features. Existing features (search, sorting, filtering, recommendations) relabeled with AI terminology. The underlying mechanism is rules-based or heuristic, not model-driven. Common in older platforms that have added AI marketing without AI capability.

Level 2: Foundation model integration. The vendor has integrated a third-party foundation model (GPT-4, Claude, Gemini) via API. The AI capability is real, but it's primarily driven by the underlying model's general capability, not the vendor's proprietary training or fine-tuning. The vendor's value-add is the prompt engineering, the data pipeline, and the UX.

Level 3: Fine-tuned models. The vendor has taken a foundation model and fine-tuned it on domain-specific data, often data from their customer base. The model performs better on domain-specific tasks than a general model would, but the underlying architecture is still from a third party.

Level 4: Proprietary models. The vendor has developed and trained their own model architecture. This is rare and expensive. Most SaaS vendors claiming AI capability are at Level 2 or 3.

Level 5: Genuine AI-native architecture. The entire product is designed around AI inference: not a bolt-on feature, but a core architectural decision. The product would not function without the AI component.

Knowing which level you're evaluating changes how you assess claims, what questions you ask, and what risk you're accepting. For the governance and policy layer that should govern which AI SaaS tools your teams can deploy, the AI governance policy for departments is the internal complement to this vendor-side evaluation.

The Five-Question AI Evaluation Framework

Question 1: What Model Powers It, and Who Owns the Model?

This question separates Level 1-2 from Level 3-5 and reveals the vendor's actual AI investment.

What to ask:

  • What AI model or models power your AI features?
  • Did you build the model, fine-tune a foundation model, or call a foundation model API directly?
  • If you're using a foundation model API (GPT, Claude, Gemini), what happens if that provider changes pricing, availability, or API terms?
  • If you've fine-tuned a model, on what data was it trained?

Red flags:

  • Vendor declines to identify the underlying model
  • Vendor claims to have built a proprietary model but can't explain the architecture or training approach
  • Vendor depends entirely on a single foundation model API with no fallback

What good answers look like: "We use [Foundation Model] via API for [specific features]. We've also fine-tuned a custom model for [specific domain task] trained on [anonymized, consented customer data]. Our AI infrastructure is multi-model, so we can swap the underlying model if the provider changes terms."

Question 2: What Data Does the AI Train On?

This is the most critical data governance question for AI-enabled tools, and it's the one most vendors are evasive about.

There are three data regimes to understand. The NIST AI Risk Management Framework provides a structured approach to categorizing how AI systems interact with input data — specifically the distinction between inference-time processing and training-time data use that governs your privacy exposure.

Inference only (your data used for output, not training): Your data goes in, you get an output, and nothing about that interaction updates the underlying model. Your data is processed but not retained for training. This is the standard for enterprise AI tools with strong data governance.

Shared training (your data used to improve the model for all customers): Your data (or signals derived from your data) is used to update the model that serves all of the vendor's customers. This is how many consumer AI tools work. It's inappropriate for business data without explicit consent and a clear privacy framework.

Isolated per-customer training: The vendor trains separate model instances per customer. Your data improves only your model. This is technically more expensive and operationally more complex, but it's increasingly offered as a premium option for data-sensitive customers.

What to ask:

  • Is customer data used to train your AI models?
  • If yes, is it shared model training or per-customer?
  • Can customers opt out of training data contribution?
  • What data, specifically, is used for training: raw inputs, derived signals, or something else?
  • Where is this documented in the DPA or data processing addendum?

Question 3: What Does the AI Actually Do vs. What the Human Still Does?

AI demos tend to show the best case: the model outputs a perfect draft, the automation completes the workflow, the insight surfaces at exactly the right moment. The real workflow includes the failure cases, the review cycles, and the tasks the AI still can't do reliably.

What to ask:

  • In a typical production workflow, what percentage of AI outputs does a human review before use?
  • What does a user do when the AI output is wrong? What's the correction workflow?
  • What are the known failure modes, the tasks where the AI consistently underperforms?
  • Is the AI fully automating a workflow, or augmenting a workflow that humans still complete?

The "what does the human still do" question is the most revealing. If the honest answer is "humans review everything before it goes anywhere meaningful," you're looking at an AI-assisted workflow, not an AI-automated one. That may still be valuable, but it's a different product than what the demo implied. For context on how mid-market teams are actually integrating AI tools into their workflows, the AI tools stack for mid-market guide covers which categories are delivering consistent ROI and which are still maturing.

Question 4: How Is Accuracy Measured and Reported?

Accuracy claims in AI demos are almost always run on the vendor's test data, in optimal conditions, with cherry-picked examples. What you care about is accuracy on your data, in your workflow, with your edge cases. Stanford's AI Index Report documents the consistent gap between benchmark performance on curated test sets and real-world performance on production data — a structural problem across AI systems that vendor-controlled demos systematically obscure.

What to ask:

  • How do you define and measure accuracy for your AI features?
  • What's the accuracy rate on production data vs. test/demo data?
  • How does accuracy change as the input data quality varies?
  • Are accuracy benchmarks available from customers in our industry and use case?
  • How has accuracy changed in the past six months?

What to watch for:

  • Accuracy claims with no methodology (e.g., "95% accurate" with no definition of what constitutes a correct output)
  • Accuracy measured on inputs that are cleaner or more structured than your actual data
  • Accuracy numbers that haven't been measured against production customer data

Question 5: What Happens When It's Wrong?

Every AI system produces errors. The question is whether the product is designed to surface errors gracefully, whether errors are contained, and whether the vendor takes responsibility for downstream consequences.

What to ask:

  • How does the product surface low-confidence outputs to users?
  • Is there an audit log of AI-generated decisions or outputs?
  • What's the escalation path when an AI error causes a downstream problem?
  • What's in the contract regarding liability for errors in AI outputs?
  • How do customers report systematic errors, and how quickly are they addressed?

The AI Capability Evaluation Scorecard (20 Criteria)

Score each criterion 1-5. A total score below 50 suggests the AI claims are primarily marketing.

Model and Architecture (max 20)

  1. Underlying model clearly identified (1-5)
  2. Model architecture appropriate for the use case (1-5)
  3. Vendor has meaningful proprietary value-add beyond API call (1-5)
  4. Multi-model resilience (not single-point-of-failure on one provider) (1-5)

Data Governance (max 20) 5. Customer data not used for shared model training (or clear opt-out) (1-5) 6. DPA covers AI-specific data handling explicitly (1-5) 7. Data residency and processing location confirmed (1-5) 8. Data deletion process post-termination confirmed for AI-derived data (1-5)

Performance and Reliability (max 20) 9. Production accuracy rate documented with clear methodology (1-5) 10. Failure modes identified and communicated (1-5) 11. Low-confidence output surfacing built into the UX (1-5) 12. Accuracy on customer's actual data testable in POC (1-5)

Workflow Integration (max 20) 13. AI automates meaningful portions of the workflow (not just a sidebar suggestion) (1-5) 14. Human review points in the workflow are clearly designed (1-5) 15. Escalation path for AI errors is documented (1-5) 16. Audit trail of AI decisions is available (1-5)

Roadmap and Maturity (max 20) 17. AI features are in production (not promised roadmap items) (1-5) 18. Accuracy improvement trajectory over the past 6 months (1-5) 19. AI development team and expertise visible (1-5) 20. Customer references specifically for AI feature use (1-5)

Score interpretation:

  • 80-100: Credible AI capability; proceed with POC
  • 60-79: Partial AI capability; clarify gaps before committing
  • 40-59: AI claims are primarily marketing; validate carefully before purchase
  • Below 40: AI is superficial or rebranded; evaluate on non-AI merits only

The 15-Question Data Processing Questionnaire for AI Vendors

Send this before any contract discussion that includes AI features:

  1. What AI models or technologies power your AI features?
  2. Did you build, fine-tune, or API-integrate the underlying model?
  3. Is customer data used to train, improve, or update any AI models?
  4. If yes, is this shared across customers or isolated per customer?
  5. Can customers opt out of AI training data contribution?
  6. Where is the AI model running: on your infrastructure, a cloud provider, or the foundation model provider's infrastructure?
  7. What customer data specifically is processed by the AI? (inputs, metadata, derived signals?)
  8. How is AI-processed data handled differently from non-AI data in your privacy framework?
  9. Is there a specific AI data processing addendum to your DPA?
  10. Where is AI-processed data stored geographically?
  11. How is AI-generated output attributed in audit logs?
  12. What happens to AI-derived data when the customer contract ends?
  13. What are the known accuracy limitations of your AI features?
  14. What liability does the vendor accept for errors in AI-generated outputs?
  15. Can we run a 30-day POC on our own data with pre-agreed accuracy benchmarks?

The 30-Day AI Pilot Design Template

The best way to evaluate AI capability is a structured proof of concept on your own data.

Pre-POC setup (Week 0):

  • Define the specific workflow the AI is meant to improve
  • Document the baseline (current state without AI; see measuring SaaS ROI 90 days after purchase)
  • Set pre-agreed success metrics: accuracy rate, time savings, human review rate
  • Confirm data requirements for the POC environment

Week 1-2: Controlled testing

  • Run the AI feature on a representative sample of your data
  • Measure accuracy against your pre-agreed definition
  • Document failure cases and review rate

Week 3: Edge case testing

  • Deliberately test with messy, incomplete, or edge-case inputs
  • Measure how accuracy degrades
  • Document whether the product surfaces low-confidence outputs appropriately

Week 4: Workflow integration

  • Test the AI feature in a simulated production workflow
  • Measure actual time savings (not estimated)
  • Get feedback from two or three team members who'd use it daily

POC success gate: If the AI feature meets your pre-agreed accuracy threshold and time savings target, you have evidence to support a purchase decision. If it doesn't, you have evidence to either renegotiate the scope or decline.

AI Marketing Terms Decoded

Vendor Says What It Often Means
"AI-powered" At least one AI API call is in the product
"Machine learning driven" Rules-based system with some statistical component
"Proprietary AI" May be a fine-tuned version of a public model, not a built-from-scratch system
"Trained on billions of data points" Uses a foundation model trained on public data
"Industry-specific AI" Fine-tuned on some domain data; amount and quality unspecified
"Intelligent automation" Automation with some conditional logic
"AI assistant" Chatbot, often GPT-based with a custom prompt
"Predictive insights" Statistical forecasting, accuracy varies widely
"Real-time AI" API calls made during the user session, not pre-computed
"No-hallucination guarantee" Retrieval-augmented generation (RAG) system; reduces but doesn't eliminate hallucination

Learn More