Bahasa Melayu

DS Metrics: Shipped Models, Business Impact, Model Decay

You spent six weeks pushing AUC from 0.84 to 0.89. Your VP looks at the slide, nods, and asks, "OK, what did that buy us?" You don't have a number. The room goes quiet for the wrong reason.

This is the gap most data scientists fall into. We measure model accuracy. The CFO measures dollars. When those two columns don't reconcile on a QBR slide, headcount reviews don't ask "what was your F1?" They ask "what did the DS team ship?" If you can't translate model work into business language, you get cut before the engineer who shipped a button.

So let's fix the metrics. Five of them. Each one is defensible in a room with a finance partner who has never opened a Jupyter notebook and doesn't plan to.

Why this matters now

Every DS team I've watched survive a budget cycle had the same trait: their lead could name dollar numbers. Not "we improved precision by 3 points." Not "we shipped 12 experiments." Dollars. Hours. Tickets deflected. Margin recovered.

The teams that got cut talked about model quality in isolation. They had beautiful confusion matrices and zero evidence that any decision in the company changed because of a model.

Headcount conversations in 2026 are sharper than they were three years ago. The cheap-money era taught DS teams to measure inputs (papers, experiments, AUC). The current era only counts outputs that show up in a P&L. If you came up under the old rules, you have to retrain yourself, fast. The metrics below are how you do it.

The 5 metrics that actually matter

1. Shipped models in production

The count of models serving real production traffic, attached to a real decision, owned by a real on-call.

Not notebooks. Not "deployed to staging." Not "ran a backfill once and emailed the results to ops." A model that is serving requests, has a runbook, and breaks something visible if it goes down.

Target: 2-4 shipped models per IC per year.

That number sounds low. It isn't. A shipped model means: data pipeline in prod, training pipeline in prod, serving stack in prod, monitoring in prod, downstream consumer wired up. Most DS overestimate how many they've actually done because they count notebooks. Count what's on-call. The number gets honest fast.

If you shipped zero last year, that's the conversation. Why? Was it the platform? Was it scope? Was it a stakeholder who never integrated your output? Each answer points to a different fix and none of them are "I need a better model."

2. Business impact in dollars

Every shipped model gets a dollar number attached. Revenue lifted, cost saved, hours returned (multiplied by loaded hourly rate), churn prevented, fraud caught.

Target: each shipped model >= $250K annualized impact, or kill it.

The $250K floor is rough. Adjust for company size. A 30-person startup can defend $50K models if they're cheap to run; a public company shouldn't bother under $500K. The principle holds: every model has a number, and if the number is small, the model goes away or the headcount it consumes does.

How to actually compute it (not theoretically, on a slide):

  • Revenue model: lift in conversion rate × baseline traffic × AOV × annualized. Get finance to agree to the baseline before you ship. Pre-agreement is everything; post-hoc lift claims get challenged forever.
  • Cost model: tickets deflected × cost per ticket. Hours saved × loaded rate. Inventory write-down avoided. Get a number from finance for cost-per-ticket, and don't guess.
  • Risk model: fraud caught × average loss per case. Bad-debt avoided × write-off rate.

Whatever you compute, put the methodology in a footnote on the slide. "Lift measured against pre-launch baseline approved by FP&A on 2026-02-14." That sentence is worth more than the number itself, because it means the number won't get re-litigated next quarter.

3. Model decay rate

The percentage drop in your production metric versus your training-time metric, measured monthly.

Most models lose 5-20% of their headline metric in the first 90 days of production. Drift in input distributions, label leakage that didn't show up in offline eval, seasonality the training data didn't cover. Normal stuff. The danger isn't decay. It's silent decay.

Target: anything decaying more than 15% per quarter without a retrain plan is a liability. Either fix it or kill it.

A worked example. Suppose your fraud model trained at 0.91 AUC. After ship:

  • Month 1: 0.89 AUC in production. Drop = (0.91 - 0.89) / 0.91 = 2.2%. Within noise.
  • Month 2: 0.86. Drop = 5.5%. Watch it.
  • Month 3: 0.81. Drop = 11.0%. You have a problem; investigate.
  • Month 4: 0.76. Drop = 16.5% over training. Liability.

If you don't have a retrain pipeline that can catch this in month 2, build one before you build any new model. A model that decays silently is worse than no model. It gives the business false confidence.

The one-line dashboard your VP wants on this: "X of N production models have drift alerts wired and a retrain SLA. Y of N do not." That ratio tells them how much surface area is actually under control.

4. Time-from-experiment-to-prod

Days between "the notebook works" (offline eval clears the bar) and "production traffic is hitting the model."

Target: under 45 days. 60 days is acceptable for a hard model. Above 90 days means the platform is broken, not you.

This is the metric most data scientists won't put on a slide because it makes them look slow. Put it on the slide anyway. If your number is 120 days, that's a platform conversation, not a performance conversation. The fix is feature stores, training pipelines, model registries, and deploy automation, not "the data scientist needs to work harder."

When a VP sees this number and it's bad, they should be having an org-design conversation: do we need an ML platform engineer? Do we need to consolidate the deployment toolchain? Do we need to stop letting every team ship its own bespoke serving stack?

The first time I walked into a QBR and put cycle time on the slide, my VP's first reaction was defensive. By the end of the meeting, she'd written "ML platform Q2 priority" on the whiteboard. That number unlocked a hire.

5. Business partner NPS

A quarterly two-question survey to the PMs, ops leaders, and analysts who consume your models.

  1. On a 0-10 scale, how likely are you to recommend working with our DS team to a peer in another company?
  2. Why?

Below 30 (NPS) means you're solving the wrong problems, your communication is bad, your delivery is unreliable, or some combination. The free-text answer tells you which.

Target: NPS >= 50, with a hard floor of 30. Below 30 is a re-prioritization signal, not a "do better next quarter" signal.

Why include this with hard metrics? Because the four metrics above are all lagging. By the time decay or shipped-model count tells you the story, two quarters have passed. Partner NPS leads. When the PM you support stops asking you to scope new work, you have six months before the dollar number goes flat. NPS catches it before that.

Run it. Send a Form, not an email. Anonymize the responses. Read the free-text. Adjust.

The "high accuracy, no impact" diagnostic

Here is the moment you'll find yourself in: a model with great offline metrics, deployed for two quarters, that nobody on the business side can point to as having changed anything. Run this checklist before your VP runs it on you.

4-question diagnostic (copy this into your QBR prep doc):

[ ] 1. Was the model output tied to a specific decision?
      (Not "informed strategy." A specific decision: discount yes/no,
       ticket priority high/low, lead routing to rep A or rep B.)

[ ] 2. Did that decision actually change because of the model?
      (Did anyone behave differently? Pull the before/after data.
       If decision rate is identical pre- and post-launch, the
       model is decoration.)

[ ] 3. Was the changed decision worth money?
      (Decisions can change without value. If reps started routing
       leads differently but conversion didn't move, that's $0.)

[ ] 4. Did finance agree with the methodology?
      (Get this in writing BEFORE the QBR. "FP&A approved the
       baseline on YYYY-MM-DD" is the magic sentence.)

If you answer "no" to any of the four, you don't have a business-impact metric. You have a story. Stories don't survive a CFO. Either fix the underlying gap or kill the model and free up the headcount.

The trap most teams fall into is question 1: they ship a propensity score and call the work done. A score isn't a decision. The score sitting in a database isn't worth anything. The decision rule that consumes the score and changes behavior is where the dollars come from. If that rule doesn't exist, the model is a hobby.

The QBR slide

One slide. Five rows. Last quarter, this quarter, delta. One model story with a dollar figure underneath.

Here's what mine looks like (numbers are illustrative, format is real):

Metric Q1 2026 Q2 2026 Delta
Models in production 7 9 +2
Annualized business impact $2.1M $3.4M +$1.3M
Avg model decay (last 90d) 11% 8% -3 pts
Median experiment-to-prod 52 days 38 days -14 days
Business partner NPS 41 56 +15

Q2 highlight: Lead-scoring v2 (shipped April 14) Routes inbound leads to reps based on conversion propensity. Replaced round-robin. Measured against pre-launch baseline (approved by FP&A 2026-03-22): conversion rate 4.1% → 5.6%. Annualized impact: $1.1M new revenue. Decay alarms wired; retrain SLA 30 days.

That's the whole slide. Five numbers. One model story. One footnote citing the FP&A baseline. No AUC anywhere on the page.

Could I have put AUC on it? Sure. The model is 0.87, up from 0.81 in v1. Nobody in that room cares. If they did, they'd ask, and I'd answer. They won't ask. They'll ask whether $1.1M is real, who signed off on the baseline, and what the on-call rotation is when it breaks.

That's the conversation a metric is supposed to start. AUC doesn't start that conversation. Dollars do.

Vanity-metric traps

Five metrics I see DS leads accidentally optimize for, that look productive and aren't.

Publication count. Papers are great for hiring senior DS into research orgs. They are not what your VP defends in a P&L review. If you're at an applied team and your top-line metric is publications, you're playing the wrong game. The CFO doesn't read NeurIPS.

Kaggle rank. Useful for personal brand. Useless for company impact. A senior DS with no Kaggle profile and four shipped revenue models beats a Kaggle Grandmaster with two notebooks every time on the question that matters: did the business get better.

Model AUC alone. AUC is a model-quality metric. Model quality is a means; business outcome is the end. AUC on a slide without dollars next to it makes the room think you're hiding something. Often you are, including from yourself.

Notebook count. I have seen DS resumes that list "ran 47 experiments." Forty-seven experiments and zero shipped models is a worse signal than four experiments and four shipped models. Ratio of ships to experiments is the real number.

"Models built." Watch this phrasing. "Built" is not "shipped." "Built and demoed to the team" is not "shipped." "Built and integrated into a dashboard PMs sometimes look at" is not "shipped." If a model isn't serving production traffic on a real decision, it is in a drawer. The number that goes on the slide is the number actually in production.

The pattern across all five: they measure work done, not value delivered. CFOs measure value delivered. So should you.

Putting it on your calendar

If you take one thing from this:

  1. By Friday: count your shipped models (real definition) and write down the dollar number for each.
  2. By next QBR: get FP&A to approve a baseline for any model that doesn't have one. In writing.
  3. Every month, log the production-vs-training metric for each model. If decay > 15%, escalate.
  4. Every quarter — send the 2-question NPS survey. Read the free-text.
  5. Every QBR — bring the 5-row slide. Lead with dollars, not AUC.

The job isn't model quality. The job is shipped impact. AUC is a means; dollars are the end. If you can't name the dollar number for every model you've shipped, you don't have a metric. You have a hobby.

Learn More