English

Moving Models to Production Without Breaking Them

The notebook scored 0.91 AUC on your laptop. The model is now live, returning -infinity for 3% of requests, the on-call engineer is paged at 2am, and you can't reproduce the original training run because you never pinned the random seed. Welcome to the gap between "model works" and "model ships."

I've watched this scene play out at six companies, with five different ML stacks, and the pattern doesn't change. The losses aren't algorithmic. They're operational. Training/serving skew quietly drops production AUC by 4-7 points. A feature pipeline that worked yesterday emits NaNs today because someone changed an upstream schema and didn't tell you. Eng treats your model as a black box because you treated their service as one.

This is the playbook I wish someone had handed me before my first prod incident. It covers the seven things that decide whether your model becomes a quiet revenue line or a recurring postmortem topic.

Why this matters now

Most data science teams lose 30-50% of model value between offline AUC and online business impact. That's not a metric I made up. Run the numbers on your last three shipped models: offline lift, online lift, time from "merged" to "fully ramped." The gap is almost always staffing-shaped, not algorithm-shaped.

The DS who owns rollback discipline, monitoring, and the eng interface ships 5x more value than the DS with a marginally better model architecture. Production readiness is a discipline, not a tool choice. You can do it on AWS Batch and a Postgres table. You can also fail spectacularly with the most expensive feature store on the market.

Feature pipeline: Feast vs Tecton vs DIY

Every prod ML failure I've personally caused traces back to features. Specifically: the features the model trained on were not the features the model saw at scoring time. We call this training/serving skew, and it's the silent killer.

Three options:

DIY (Postgres table or warehouse view). Fine when you have one model, batch scoring, and the same SQL produces training data and serving data. Most companies should start here and stay here longer than they think. The trap: when you start adding realtime features, the SQL you wrote for training (a 7-day rolling sum from the warehouse) is not the SQL the serving service runs (a streaming aggregate from Redis). They drift. Silently.

Feast (open source). Free, you host it. Earns its keep when you have 3+ models sharing features, want the same definition used in training and online serving, and are willing to operate Redis/DynamoDB and a Spark or Flink pipeline. The honest tax: you'll spend a quarter onboarding before features start flowing. Worth it if you're past the one-model phase.

Tecton (managed). Buy it when feature engineering is a real bottleneck, you have a budget for $80-200K/year, and you'd rather not run streaming infra. Tecton solves training/serving skew by making one definition produce both backfills and online values. Their lineage tracking catches "this feature changed last Tuesday" before you do.

The decision is not "which tool is best." It's: do I have one model or many, batch or realtime, and is feature drift currently invisible to me? If yes-yes-yes, you need a store. If no-no-no, a warehouse view and a feature_definitions.py file you import in both training and serving is more than enough.

Training pipeline reproducibility

If I asked you to rerun the exact training job that produced the model in prod right now, from raw data, with the same splits and the same metrics, could you do it in 30 minutes?

If the answer is no, you don't have a reproducible pipeline. You have a souvenir.

What reproducibility actually requires:

  1. Pin every seed. Not just random_state=42 on the train/test split. Seed numpy, seed PyTorch/TensorFlow, seed your sampler, seed your data shuffler, seed any augmentation. I've debugged a model where two engineers got 0.86 and 0.91 AUC from "the same notebook" because torch's CUDA RNG was unseeded. Three days lost.

  2. Hash your splits. Don't trust train_test_split to be deterministic across pandas versions. Compute a stable hash from a row identifier (user_id, transaction_id) modulo the split ratio. Same row, same split, forever. Bonus: when you retrain on new data, old test-set rows stay in test.

  3. Record the dataset SHA in the model artifact. The model card should include: SHA-256 of the training dataset (post-feature-engineering), training window (2025-10-01 to 2026-03-31), feature schema version, code commit, library lockfile hash, eval metrics on held-out, eval metrics on a frozen reference set. This goes in the same Git LFS or MLflow artifact as the model weights.

  4. Lock the runtime. A requirements.lock or poetry.lock or uv.lock committed alongside the model. Library version drift breaks reproducibility quietly. scikit-learn 1.3 vs 1.4 is enough to shift predictions.

The "rerun training from 6 weeks ago" test is non-negotiable. If you can't pass it, you can't credibly debug a regression. You're guessing.

Batch vs realtime serving — when each

Most "realtime" models should be batch. I'll say it twice because it gets ignored every time. Most "realtime" models should be batch.

The decision tree:

  • Latency budget > 1 hour, prediction is slow-changing → batch nightly. Score everyone at 2am, write to a table, the product reads from the table. Lead scoring, churn risk, content recommendations for non-cold-start users. p99 latency: a SELECT.
  • Latency budget 5-60 minutes, prediction needs hourly freshness → batch hourly. Same shape, more frequent. Inventory forecasting, demand signals.
  • Latency budget 30s-5 min, depends on session activity → microbatch. Streaming consumer scores in batches of 100-1000 records every 30 seconds. Fraud signals, anomaly detection where action can wait a minute.
  • Latency budget < 200ms, request is unpredictable → realtime. Ad ranking, search relevance, fraud blocks at checkout, real-time personalization. This is the expensive one. p99 latency budget should be set in the interface contract before you train, not after you deploy.

The cost difference is enormous. A nightly batch job runs on a single big instance for an hour. A realtime service needs autoscaling, warm pools, p99 monitoring, and an SRE on rotation. Pick batch unless you have a real reason.

A war story: we shipped a "realtime" recommendation model that called Postgres synchronously for three feature lookups per request. p99 went to 4 seconds before we caught it. The fix wasn't faster Postgres. It was admitting the model didn't need to be realtime, moving it to batch, and serving from a precomputed table. Latency dropped to 8ms. The product team didn't notice the change because the recommendations weren't actually session-dependent.

Model monitoring: drift, concept shift, business metric

Four things to monitor. Three of them are diagnostic. Only one pays the bills.

Input drift. Are the features the model sees today distributed like the features it trained on? Track Population Stability Index (PSI) per feature, daily. PSI > 0.1 = investigate. PSI > 0.25 = your training distribution and your prod distribution are no longer the same. Alert. For continuous features, also run a Kolmogorov-Smirnov test against a reference window. Cheap, fast, catches schema breaks before predictions go bad.

Prediction drift. Are the model's outputs distributed like they were last week? Sometimes input drift is invisible (feature interactions move) but prediction drift is loud. Track p10/p50/p90 of model output daily.

Concept drift (label-delay aware). Has the relationship between features and the label changed? You can only check this when labels arrive, which for many models is days or weeks later. Build a delayed evaluation pipeline: when labels land, recompute AUC/MAE on those rows and chart it over time. The trap is alerting on AUC the day after deploy when you have no labels yet. You'll be staring at 0.

The business metric. Revenue. Conversion rate. CAC. Lifetime value. The thing the model exists to move. This is the only metric that decides whether the model stays on. Alert thresholds on input/prediction drift are diagnostic. Alert thresholds on the business metric are existential.

I've seen teams ship a model with beautifully calibrated drift dashboards and zero visibility into whether revenue moved. Don't be that team. The first dashboard is the business one. The drift dashboards exist to explain why the business metric moved, not to substitute for it.

The "shadow mode" rollout

Score-but-don't-act. For one to two weeks. On the same traffic. Compare the new model's predictions against the incumbent's predictions and against actual outcomes when labels arrive.

This is the single highest-leverage practice I know in production ML, and most teams skip it.

The shadow rollout flow:

  1. Deploy the new model alongside the incumbent. Both score every request. Only the incumbent's predictions affect product behavior.
  2. Log both predictions plus all features used.
  3. After 7-14 days, compare:
    • Distribution overlap: are the two models making similar predictions on the same inputs? If 30% disagree, find out why before flipping.
    • Match against offline expectations: does the new model's online behavior match what your held-out evaluation predicted? If offline said +5% lift and shadow says they're identical, your training data was leaky.
    • Latency, error rate, edge cases (NaN inputs, missing features): does the new model handle the long tail?
  4. Only flip the switch when shadow data confirms the offline story.

Champion/challenger is the same idea, formalized. The incumbent is the champion. The new model is the challenger. You don't promote the challenger until it has won under real traffic, not just on a test set.

The rollout itself should be staged: 1% → 5% → 25% → 50% → 100%, with at least 24-48 hours between steps and the business metric monitored at each stage. If anything moves the wrong way, pause. Don't rationalize.

Eng partnership patterns: interface contracts

The "throw the pickle over the wall" model fails every time. Here's what works instead.

Before you start training, sit down with the platform engineer and write a one-page interface contract. It covers:

  • API schema. Input fields, types, allowed nulls, validation rules. Output fields, types, ranges. What does the response look like when the model can't score (missing features, model server down)?
  • Latency SLO. p50, p95, p99 in milliseconds. This determines whether you can do realtime feature lookups, what model architectures are off the table, whether GPU inference is required.
  • Throughput SLO. Requests per second at peak. Drives autoscaling configuration.
  • Error budget. What percentage of requests is the service allowed to fail? 0.1%? 1%? This sets the bar for input validation strictness.
  • Ownership boundary. Who pages when the model returns wrong predictions vs when the service returns 500s? DS owns the first. Eng owns the second. Both own the third (model is up but predictions are bad).
  • Rollback authority. Who can flip the kill switch without escalation? At 3am, the answer should be "the on-call engineer," not "page Camellia."

Write this down. Sign it. Pin it. When the inevitable production incident happens, the contract is what keeps the conversation about fixing the issue instead of negotiating responsibility.

The DS who shows up to that meeting with a draft contract earns trust the moment they walk in. The DS who shows up with a notebook and asks "how do we deploy this?" starts at zero.

Rollback discipline

Every deploy ships with a tested rollback path. Not a documented rollback path. A tested one.

Three components:

  1. Version tagging. Every model gets a version (pricing-model-v23). Every prediction logged includes the version that scored it. When you investigate a regression, you can filter by version.
  2. The kill switch. A configuration flag that swaps traffic from the new model back to the previous one in under 60 seconds. Not a redeploy. A flag flip. The serving service reads the flag at request time.
  3. Rollback exercises. Every quarter, exercise the rollback in production for 5 minutes during a low-traffic window. If you haven't pulled the trigger in 90 days, the rollback is theoretical, which means it doesn't work.

The first time you need a rollback is not the time to discover that the previous model's artifact has been deleted from S3, the feature pipeline has been refactored, or the kill switch never worked. I have personally lived all three.

Common pitfalls

  • Training on data the model can't see at serving time. Leakage. The most common form: a feature that's only computable after the event you're predicting. A "days since last login" feature computed from the warehouse will include logins from after the prediction timestamp unless you explicitly cut it off.
  • Pinning random_state=42 but not seeding the upstream sampler. Your splits are deterministic, your training is not.
  • Deploying without champion/challenger. You're guessing whether the new model is better.
  • Alerting on AUC instead of revenue. AUC went from 0.84 to 0.82 — bad? You don't know. Maybe revenue went up.
  • No rollback test in 90 days. You don't have a rollback. You have a wish.
  • Treating eng as a service desk instead of a partner. They'll treat your model the same way.

Templates and tools

The four documents every shipped model should have:

  • Model card. Dataset SHA, training window, seed, feature schema version, eval metrics on held-out and frozen reference, serving SLO, owner, deploy date, rollback procedure.
  • Shadow-mode comparison checklist. Distribution overlap, offline-online match, latency, error rate, edge case handling, sign-off requirements per stakeholder.
  • Rollback runbook. Step-by-step kill switch flip, how to verify the previous model is serving traffic, who to notify, postmortem template.
  • Eng/DS interface contract. One page. API schema, SLOs, error budget, ownership boundary, rollback authority. Signed by DS, eng lead, and the on-call rotation.

Measuring success

You're doing this right when:

  • You can rerun any prod model's exact training run from raw data in under 30 minutes.
  • Every deploy has a tested rollback that's been exercised in the last 90 days.
  • Monitoring catches drift before the business metric moves, not after.
  • Your platform engineer says "working with you is easy" without prompting.
  • The last three deploys were boring.

Boring deploys are the goal. Drama is failure. The DS who ships boring is the DS who ships often, and the DS who ships often compounds value faster than anyone with a marginally better model architecture.

Learn More