Español

A Day in the Life of a Data Scientist

It's 8:07am. Your phone buzzed at 6:42 with a model decay alert from MLflow. The PM Slack-DMed you at 7:55: "quick q — why did this user get flagged churn?" You haven't opened your laptop yet. Welcome to the job that nobody describes accurately on LinkedIn.

The job description you signed up for promised "build ML models, drive insights, partner with stakeholders." All of that is true, technically. The proportions are not. On a typical Tuesday at a B2B SaaS company, model-building is maybe a quarter of your day. Feature engineering and SQL plumbing eat another big chunk. The rest is translation work: explaining to a VP why precision is not accuracy, defending a confidence interval to a sales leader, writing a Loom that nobody watches, and reminding the platform team that yes, the dbt model that breaks every Wednesday is still breaking every Wednesday.

This is not the senior-DS-on-LinkedIn version of the job. It's the actual one. If you're three months in and your week feels nothing like the role you interviewed for, that's not a bug in you. It's the role.

Here's what a real Tuesday looks like.

8:00 to 8:30am: The model performance check

Coffee in hand, laptop open, the first tab is always MLflow or Weights & Biases. You scan yesterday's predictions against the labels that came in overnight. The churn model AUC dropped from 0.84 to 0.79 over the weekend. That's not a catastrophe, but it's enough to investigate before standup.

You check the obvious things first. Did the input distribution shift? You pull up the feature drift dashboard. Two features are drifting. One is "days since last login," which makes sense because the product team shipped a new onboarding flow on Friday and a bunch of old users got nudged back in. The other is "support tickets in last 30 days," which doesn't have an obvious explanation. You make a note to dig in after standup.

This thirty-minute check is one of the most underrated parts of the job. Half the senior DS folks I know got promoted on the strength of catching something here that nobody else would have caught. The model didn't break. The world the model was trained on broke, slightly, and you noticed before anyone else.

You also check yesterday's experiment runs in MLflow. Two of your teammates kicked off training jobs at 11pm. One converged. One blew up because somebody renamed a column in the source table without telling anyone. Welcome to data work.

9:30 to 11:30am: Experiment design (90 minutes of arguing about the metric)

The PM walks into the standup with a question: "Does the new pricing page convert better?" Easy, right? Sample size, two-proportion test, ship it.

It is never that easy.

You spend 90 minutes in Hex scoping the experiment. The math is the easy part. Here's what actually eats the time:

Defining "winning." The PM wants to measure clicks on the "Start free trial" button. You point out that clicks aren't the same as starts, starts aren't the same as activations, and activations aren't the same as paid conversions. By the time you're done, the success metric has changed three times and you have a guardrail metric for first-week retention because the old pricing page funnel had a known leak after activation.

Power calc reality check. With your current weekly traffic, detecting a 3% relative lift in paid conversions requires running the test for six weeks. The PM wants results in two. You agree to a noisier proxy metric (trial starts) for the early read and the real metric (paid conversions) for the call.

Guardrail metrics. What if the new page increases trial starts but tanks the quality of those trials? You add a guardrail on activation rate. What if it changes the mix of plan tier selections? You add a guardrail on average revenue per new customer. Now there are three metrics, two guardrails, and a stop rule.

Segmentation pre-commit. The PM asks if you can "just slice it by company size after." You say no, and then yes, but only with a Bonferroni correction and a written-down list of the segments before the test starts. You write the list together. This will save somebody's job later.

You publish the experiment design doc to the team's Hex workspace, link it in the project Notion page, and tag the PM and the eng lead. The math took 15 minutes. The conversation took 75. That ratio is roughly right for the rest of your career.

12:30 to 2:00pm: Async with eng on the production deployment

You wolf down lunch at your desk. The afternoon block is the one that frustrates new DS folks the most: your model has been "ready to ship" for three weeks, and the blocker has nothing to do with the model.

The model itself is fine. The issue is the feature pipeline. Your churn model needs a feature called weighted_engagement_30d, which is a weighted sum of logins, message sends, and meeting attendance over the last 30 days. That feature is computed in a dbt model called fct_user_engagement_daily. The dbt model breaks every Wednesday morning between 6 and 7am Pacific because it depends on a Salesforce sync that finishes inconsistently.

You don't own the dbt model. The analytics engineering team does. They know it's flaky. They have a ticket for it. They've had the ticket for two months.

So today's job is to write a long PR comment on the staging deployment. You explain:

  • The feature pipeline has a known weekly failure window
  • Your model needs the feature with at most a 24-hour staleness
  • The current monitoring on the dbt model fires after the feature is already stale
  • You'd like the platform team to either fix the upstream dbt model or wire in a fallback that uses the last good snapshot of the feature with a freshness flag

You tag the platform team lead, link the dbt monitoring alert from last Wednesday as evidence, and link your model's freshness requirements doc. You don't escalate. You don't whine. You leave a calm, specific PR comment with three options ranked by your preference and a default recommendation. Then you close the tab.

This is the part of the job that the JD calls "cross-functional collaboration." It is mostly waiting, with intermittent calm advocacy, for a thing that you cannot fix to be fixed by someone else. Make peace with it early.

2:00 to 3:00pm: The "this prediction is wrong" meeting

A sales leader has booked thirty minutes on your calendar. The subject line: "Quick chat about the churn model — one of my customers it flagged just renewed for three years."

You know how this is going to go. You've had this meeting roughly fourteen times since you started. You have a script. Here's the rough shape:

Open with curiosity, not defense. "Tell me about the customer. What's their story?" You let the sales leader vent for five minutes about how the model is going to embarrass the team, how the customer would be insulted if they knew, and how they had a feeling the model was unreliable anyway.

Reframe the metric without making them feel dumb. "The model predicts at 87% precision on the 'high churn risk' cohort. That means out of every 100 accounts the model flags, about 87 actually churn within 90 days. The other 13 don't. This account is one of those 13. That's not the model failing. That's the model performing exactly as designed."

Bring a Looker dashboard, not a defensive tone. You share your screen and pull up the churn model performance Looker. You show the precision-recall curve. You show that lowering the threshold to catch more churners would mean even more false positives, and raising it would miss real churn. You show the dollar value of churn caught last quarter ($2.4M) versus the dollar value of false positives if every flagged account walked away (much higher than that, but they don't).

Close with what the model is for, in their language. "This model is a triage tool. It's not a verdict. It tells your team where to spend the first hour of their week. The fact that one flagged account renewed means your team did their job. They intervened. That's the win condition."

The sales leader leaves the call calmer than they came in. You add the conversation to the experiment log. You add a slide to the next monthly DS-to-Sales sync called "Precision is not Accuracy" because if you've had this conversation 14 times, the rest of the sales org has had it more often than they're admitting.

4:00 to 5:30pm: Notebook cleanup and the experiment log

The last block of the day is back in Jupyter. You open the analysis notebook from this morning's pricing-page experiment scoping. It's a mess. Half of it is sketch code. There's a cell that just says df.head() with no comment. There's a chart with no axis labels. There's a SQL query against the wrong schema.

You clean it up. The discipline here matters more than the prettiness. You're not making it pretty for yourself. You're making it readable for the version of you in three months who needs to remember why you picked sample size 4,200 and not 5,000. The convention on your team:

  • Top-of-notebook markdown cell with the question, the date, and the conclusion
  • Each section has a one-line markdown header above the code
  • Charts have titles, axis labels, and a one-sentence interpretation below
  • Final cell is a markdown cell with the recommendation and the next action

You commit the notebook to the team repo. You write up a 200-word summary in the team's experiment log, which lives in a shared Notion page. You also push the dbt model change you wrote this morning. It's in a separate branch and tagged for review tomorrow.

Last thing before you close the laptop: you check the sales leader meeting notes from 2pm and add a TODO to the team backlog. "Build a sales-facing Looker dashboard that shows the churn model's precision and recall in plain English, refreshed weekly." This is the third time you've thought about building it. Maybe this week you actually will.

What the JD won't tell you

A few things the JD will gloss over that are worth knowing on day one.

You'll write more SQL than Python. Snowflake or BigQuery will be your second language. The cleaner your SQL, the faster your iteration loop. Most of the bottlenecks in your week are not modeling. They're "the data isn't in the right shape yet."

Feature engineering eats 40% of your modeling time. The XGBoost model takes 20 minutes to train. The features that go into it took two weeks to build, validate, and wire into a pipeline. New DS folks underestimate this every time.

The hardest "ML problem" is usually a stakeholder who doesn't trust the output. You can have a beautifully calibrated model with a publication-worthy AUC and zero impact, because nobody will act on it. Trust is built through repeatable explanations, dashboards in stakeholders' language, and showing up to the "this prediction is wrong" meeting calmly.

"This prediction is wrong" panic is a weekly event. Have a script ready. Bring a dashboard. Don't get defensive. The conversation is the job, not a distraction from the job.

Production deploys are slower than you think. Not because the code is hard. Because the data dependencies are flaky, the staging environment is a mirror of last quarter's production, and the platform team is also doing their best.

Tools you'll actually use

The stack varies by company, but the shape is consistent. At most B2B SaaS shops with a real DS function in 2026:

  • Python and Jupyter for analysis and modeling. Some teams have moved heavier work into VS Code with notebooks-as-percent-files, but Jupyter is still the default for exploration.
  • SQL on Snowflake or BigQuery for everything that touches production data. If you came from a place that used Postgres directly, the warehouse mental model is different, and you'll think about cost per query for the first time.
  • dbt for transformations. You'll read more dbt models than you write, but you'll write some.
  • MLflow or Weights & Biases for experiment tracking. Most teams have one or the other; it almost doesn't matter which.
  • Hex for collaborative notebooks that PMs and analysts can read. Hex is doing for analytics what Figma did for design.
  • Looker for stakeholder-facing dashboards. The dashboards you build here are what your stakeholders judge you by, even if the model behind them is the actual work.
  • Optional but common: Airflow, Prefect, or Dagster for orchestration. You may or may not own these.

You will not use all of these on every team. You will probably learn a new one in your first 90 days.

What "good" looks like after 90 days

If you're new in the role, here's a useful definition of success at the three-month mark. Forget the model count. Look for these:

  1. You can name your top three stakeholders' real questions, in their language, without a doc in front of you.
  2. You have one model in production. Even a small one. Even a logistic regression on a single feature. Production beats notebook.
  3. You've stopped flinching when a stakeholder pings you with "this prediction is wrong." You have a script. You bring a dashboard.
  4. You know which dbt models break weekly and which features are flaky. You've stopped being surprised by the Wednesday breakage.
  5. You've written at least one document that another DS on the team has linked to. Knowledge that compounds beats analysis that gets forgotten.

That's the bar. It's not "shipped five models" or "doubled team velocity." Those metrics don't survive contact with reality. The list above does.

The 50/50 split between technical work and translation work is not a flaw in the role. It's the role. Translation isn't beneath the modeling. It's how the modeling actually changes anything. The DS folks who lean in to that earlier than their peers are the ones who get promoted to Senior in 18 months instead of 30.

The model decay alert at 6:42am will buzz again tomorrow. Eat breakfast first.

Learn More