Español

AI in the Data Scientist Workflow: What to Automate, What to Never Touch

The first time Copilot hallucinated a column name on me, the model trained anyway. The pull request joined customer_id against a column called customer_uuid that didn't exist in the right-hand table. Pandas did what pandas does: silently produced NaNs for every row, threw no error, and the join "succeeded." The downstream model fit fine. The validation AUC looked normal. I caught it three days later only because a stakeholder asked why a specific cohort had vanished from the output.

Nobody warns you about that failure mode. The marketing for AI-assisted data science is full of demos where the model writes a flawless pd.merge against a clean toy dataset. The actual failure mode is silent. Code runs, results look plausible, and the bug surfaces after you've already presented the chart.

So here's the line I've drawn after about eighteen months of using these tools daily, written for working data scientists who are tired of both the hype and the reflexive rejection. Both extremes are wrong. There's a middle that ships faster without your model quality regressing, and this guide tries to describe it concretely.

Why this matters now (and why most "AI in DS" content is confused)

There are two completely different things people mean when they say "AI in data science," and conflating them produces incoherent advice.

The first is AI as a workflow accelerator for the existing data science job: boilerplate code, EDA scripts, docstrings, slide drafts. This is what your IDE Copilot, Cursor, and Claude do. The job is still building models and explaining them. AI is a faster typewriter.

The second is building LLM-powered applications: RAG systems, agent pipelines, evaluation harnesses for generative outputs. This is a different job. The skills overlap (you still need stats, you still need to think about evaluation), but the failure modes, the toolchain, and the day-to-day work are different from training an XGBoost model.

When leadership says "add AI to your workflow," they usually mean the first. When leadership says "build an AI feature," they usually mean the second. If you don't separate these, you'll waste a quarter trying to apply RAG patterns to a churn prediction problem, or worse, you'll ship a chatbot using XGBoost intuition and wonder why your evals don't work.

The rest of this guide is mostly about the first. There's a section near the end on the second.

Where AI genuinely helps (use it daily)

These are the places I let AI write first drafts every day, with light review:

Boilerplate code. Pandas reshapes I've written a hundred times: pivot, melt, groupby chains. Sklearn Pipeline and ColumnTransformer scaffolding. Matplotlib subplot grids. The mechanical stuff where the structure is fixed and only the column names vary. Cursor or Copilot will get this right 90% of the time, and the 10% it gets wrong is fast to spot because you already know what the output should look like.

EDA scripts. First-pass distribution plots, null counts, correlation heatmaps, value counts on every categorical. The "give me the shape of this dataset" pass. AI is good at this because the patterns are formulaic and the output is visual, so you'll see if something looks off. I still write the second-pass EDA myself, because that's where the actual thinking happens.

Docstrings and type hints. When you already know what the function does and you just need it documented. Highlight the function, prompt "write a numpy-style docstring with examples," and review. Saves twenty minutes per module on a real codebase.

Slide drafts and stakeholder summaries. First drafts only. I write a paragraph of bullet points about what the model does and what the result was, then ask Claude to turn it into three slides for a non-ML audience. Then I rewrite about 60% of it. The first draft is the boring part: structure, transitions, repetition for emphasis. The rewrite is where I add the parts that actually matter.

Literature review summaries. I drop in five paper abstracts and ask "which of these are relevant to predicting customer churn from event-stream data, and what's the core method in each?" The output is a triage list. I then read the papers I actually need to read. This is summarization, not interpretation, and it works because it's verifiable. If the summary says paper 3 uses transformers, I can check.

The pattern across all of these: AI is good at the parts where you already know what "right" looks like, so you can spot the mistakes. It's a typing accelerator, not a thinking substitute.

Where AI breaks (never delegate)

These are the parts I will not let AI touch, and I'll explain why for each:

Causal inference. Confounders, selection bias, identification strategy, the question of whether a regression coefficient means anything causal. LLMs will happily write a propensity score model for a question that needs a difference-in-differences design. They don't know your data-generating process. They don't know which variables are post-treatment. They don't know that your "control group" is selected on the outcome. A confidently wrong causal claim is worse than no claim, and AI is very good at being confidently wrong about this.

Modeling decisions. Which algorithm to use, which loss function, which validation scheme, how to handle leakage. These are judgment calls that depend on the business context, the data shape, and what the model is for. Copilot will suggest random forests for everything because random forests show up most often in training data. It doesn't know that your problem has temporal leakage that breaks every cross-validation scheme except a forward-chaining one. You have to make these calls yourself.

Feature interpretation. What a SHAP value means in this business context. AI can generate the SHAP plot. It can describe what SHAP values are in the abstract. It cannot tell you whether "tenure has a high SHAP value" means tenure is causally important or just that tenure proxies for something else you're not measuring. That requires knowing the business.

Business framing. Translating "the model says churn probability is 0.73 for this segment" into "we should change the renewal cadence for accounts over $50k." That's a decision-making translation, and getting it wrong is how data science loses credibility. The LLM doesn't know what your company's risk tolerance is. It doesn't know which stakeholders are skeptical of data work and need extra evidence. It doesn't know that the last time you proposed a similar intervention it failed.

The shorthand: AI is fine for the what. Never use it for the why or the so what.

The tooling stack that actually works

After trying most of them, this is the stack I land on for a working data scientist in 2026:

Cursor for code. It's VS Code with better LLM integration. The Composer feature (where you describe a multi-file change and let it propose edits) is genuinely useful for refactoring a feature pipeline across three files. I keep the autocomplete on for boilerplate and turn it off when I'm thinking about logic. The mode switch matters.

Claude (or equivalent) for code review. Before I open a PR, I paste the diff into Claude with a prompt like: "Review this for correctness, not style. Focus on: column references, off-by-one errors, leakage, deprecated APIs, and edge cases on null handling." It catches things. Not always, but often enough that I keep doing it. It's a second pair of eyes that's available at 11pm before a deadline.

Notebook-native AI (Hex Magic, Deepnote AI) on a short leash. These are great for the EDA pass: "show me the distribution of every numeric column" or "find correlations above 0.7." I do not let them write the final analysis. The leash is the rule that anything they generate gets re-run in a clean notebook before it leaves my laptop, and I read every line of generated SQL. The convenience is real, the trust is bounded.

The reason a pair of tools beats a single tool: each one has different blind spots. Cursor is good at local context (the file you're in) but bad at understanding what your data actually looks like. Claude is good at higher-level reasoning ("does this make sense?") but doesn't have your IDE context. Notebook tools are good at quick data peeks but tend to write throwaway code. You want different tools for different jobs, not one tool trying to do all three badly.

The "LLM wrote the analysis" trap

This is the failure I see most often in junior and mid-level DS, and increasingly in senior DS who should know better.

The pattern: you finish the modeling, you have a results table, you paste it into ChatGPT, and you ask "summarize the key findings." It writes a confident, articulate, well-structured narrative. You lightly edit it, paste it into the report, and ship.

The problem is that the LLM is pattern-matching to what data science conclusions usually sound like, not to what your data actually shows. It will say things like "the model demonstrates strong predictive performance, with feature importance suggesting that customer engagement is the primary driver." That sentence is structurally correct and may be entirely false. The model might be performing well only on a leaky feature. "Customer engagement" might be high-importance only because it's a near-duplicate of the target.

This is the modern equivalent of p-hacking. P-hacking was about finding a story that fit the data through enough searching. The LLM analysis trap is about getting a story written to the data without checking whether it's true. The story is plausible, the prose is clean, and the underlying claim is wrong.

How to tell when you've fallen in: if you can't, line by line, point at the specific number in the results that supports each sentence in the summary, you're in the trap. The fix is to write the analysis yourself, then ask the LLM to edit it for clarity. Editing a draft you wrote is fundamentally different from generating a draft from numbers, even if the final word count is the same.

AI for ML vs building LLM apps

A quick clarification because this confuses team conversations constantly.

A data scientist using Copilot to build a churn model is doing classical ML with an AI-assisted IDE. The model is XGBoost or a neural net. The evaluation is AUC, calibration, business impact. The deployment is a batch scoring job or a real-time API. The failure modes are leakage, drift, and miscalibration.

A data scientist building a RAG system or an LLM agent is doing something different. The "model" is a foundation model you didn't train. The evaluation is qualitative or LLM-judge-based, not AUC. The deployment is a service with prompt templates, retrieval indexes, and guardrails. The failure modes are hallucination, prompt injection, and cost runaway.

Both are legitimate work. Both can be on a data scientist's plate. But they are not the same skill, and a senior DS who's great at the first might be mediocre at the second until they put in the reps. When leadership says "add AI to the product," ask them which one they mean. If they don't know, that's the first conversation, not a coding task.

Optional ACE Framework tagging

If your team uses the ACE Framework (Ingest, Analyze, Predict, Generate, Execute), most classical DS work sits in Analyze and Predict. Building LLM apps sits in Generate. This isn't just vocabulary. It's a way to push back when scope creeps. When a PM asks "can you add a generative AI feature to the churn model," you can say: "the churn model is a Predict capability; what you're describing is a Generate capability, which is a different system with different evaluation. Let's scope them separately." The framework gives you words for the boundary you already know exists.

The 30-day adoption plan

Here's the plan I'd run if I were starting over, or onboarding a junior DS into AI-assisted work:

Week 1: Boilerplate and docstrings only. Install Cursor and set up a Claude account. Use them only for code completion on mechanical tasks (pandas reshapes, sklearn pipelines) and for writing docstrings on functions you've already written. Keep a running note (just a text file) every time the suggestion was wrong. By end of week, you should have ~20 examples of failure modes specific to your codebase. This is calibration data. It tells you when to trust the tool and when to ignore it.

Week 2: Add EDA assistance. Pick one finished project where you already know what the EDA should have shown. Re-run the EDA pass using AI assistance and compare it to your original work. Note specifically what AI missed (it often misses the contextual stuff, like "this variable looks normal but is actually a leak from the future") and what it caught faster than you would have. By end of week, you should have a written rule for when AI EDA is useful and when it's not.

Week 3: Code review loop. For every PR you open, paste the diff into Claude first with a code review prompt. Log: how many PRs got useful comments from Claude? How many bugs did Claude catch that your team's reviewers would have missed? How many false positives? After a week you'll have a sense of whether this loop is worth keeping. For me, it was, but the calibration is per-team.

Week 4: Write your team's "where we use AI / where we don't" doc. One page. List the tasks where AI is the default tool. List the tasks where AI is banned. List the tasks where AI is allowed but every output gets human review before merging. Get sign-off from your manager. The point of writing it down is that it forces the conversation, which surfaces disagreements you didn't know existed.

Common pitfalls

A short list, ordered by how often I've seen them blow up in production:

  1. Trusting hallucinated column names. Always check that the columns referenced in AI-generated code exist in the dataframe. df.columns.tolist() is your friend.
  2. Accepting a model recommendation without a second opinion. If Copilot says "use a random forest here," ask yourself why, and ask Claude separately to critique that choice. Disagreement is information.
  3. Letting AI write the executive summary. Already covered. Don't.
  4. Using LLMs for causal claims. They will give you a confident answer. The answer is uncorrelated with the truth.
  5. Forgetting that prompt context windows truncate your dataframe. If you paste a 50-row sample and ask "is there an outlier pattern," the LLM only sees 50 rows. It will not know about the long tail. The advice it gives will be wrong for the full data.
  6. Shipping the same prompt across projects without re-tuning. Your codebase has conventions. Generic code-review prompts don't catch your team-specific patterns.

Templates and tools

A working starter kit:

  • Cursor rules file for DS work. Tells Cursor about your team's conventions: which version of sklearn you're on, that you use Polars not pandas (or vice versa), that all features need a leakage comment.
  • Claude code-review prompt template. "Review this diff for: column reference correctness, leakage, deprecated APIs, edge cases on nulls, and consistency with the rest of the codebase. Do not comment on style."
  • AI usage policy one-pager. A literal one-page Google doc your team signs off on. Three columns: task, AI allowed?, review required? Hang it in your team channel.
  • EDA verification checklist. When AI generates an EDA, run through: did it count nulls correctly? Did it catch the categorical with 10,000 unique values? Did it notice the date column with timezone issues? If it missed any of these, the rest of its output is suspect.

Measuring whether this is working

Three signals, in order of importance:

  1. Time to first model on a new dataset drops measurably. If a junior DS used to take three days to get to a baseline model and now takes one, that's the win you're looking for. If the time hasn't dropped, you're not actually using AI for the parts where it helps.
  2. PR review comments about "wrong column" or "deprecated API" go to zero. These are the easy bugs. If they're still showing up, the code review loop in Week 3 isn't catching them.
  3. The team has a written policy and refers to it. Not just a doc that exists, but a doc that gets cited in PRs and design reviews. "We don't use AI for this because of section 3 of the policy" is the marker that the boundary is real.

The negative signal that matters more than all of these: nobody on the team has shipped an LLM-written analysis as their own work. If that ever happens — and it will, eventually — you don't have an AI policy problem, you have a credibility problem, and the fix is a conversation, not a tool change.

The line between "AI helps me ship faster" and "AI shipped a wrong analysis under my name" is thinner than the marketing suggests. Draw it explicitly, write it down, and revisit it every quarter as the tools change.

Learn More