Bahasa Indonesia

Data Scientist Tools and Tech Stack: The Honest 2026 Build Guide

Let me describe the stack most data science teams actually have, even ones with eight-figure ARR companies behind them.

A Jupyter notebook on someone's laptop. A CSV in S3 with a name like customers_FINAL_v3_use_this.csv. A model.pkl that someone emailed to a backend engineer in Slack three quarters ago. A Looker dashboard nobody trusts because the joins keep silently changing. A Confluence page titled "ML Architecture" that was last edited the day before the previous head of data quit.

If this is your stack, you're not behind. Most teams are here. The honest question isn't whether your setup is embarrassing. It's whether you stay here, or whether you do the boring work to get out.

This guide is for the IC data scientist (not the VP, not the platform team) who needs to either build a stack from scratch or audit the duct-taped mess they inherited. We're going to walk the Core 6 layers, name the open-source defaults, name the paid upgrades, and say plainly when each is worth it. If you want to hire for this role properly, the Data Scientist JD template is the companion piece.

Why this matters now

Models in notebooks don't make money. The gap between "I trained a model with 0.87 AUC" and "the business uses this prediction every day to make a decision" is roughly 80% tooling and 20% science. Nobody likes hearing that, especially the DS who spent three years on a stats PhD, but it's true.

The data scientist who can stand up the full stack from warehouse to monitoring is the one who gets promoted, gets headcount, gets budget, and stops being treated like a SQL-running cost center. The one who can't is the one who keeps shipping notebooks and wondering why the next layoff has their name on it.

You don't have to love MLOps. You do have to be conversant in it.

The Core 6 — what every ML stack actually needs

Six layers. Real prices. When each is worth it.

Layer Open-source default Paid upgrade Real cost When to upgrade
Warehouse Postgres, DuckDB Snowflake, BigQuery, Databricks SQL Snowflake $2-4/credit (5-figure/mo at scale), BigQuery $6.25/TB scanned Any team training off prod data weekly
Notebook / IDE Jupyter, VS Code + Jupyter Hex, Deepnote, Databricks Notebooks Hex $40-$80/user/mo, Deepnote slightly cheaper Team of 3+ DS doing collaborative work
Experiment tracking MLflow self-hosted Weights & Biases, Neptune.ai, Databricks ML W&B $20-$100/user/mo, MLflow self-host ~$50/mo VM More than 20 experiments/week or compliance
Feature store Feast Tecton, Databricks Feature Store Tecton 6-figure starting price 50+ models in prod, real reuse across teams
Model serving BentoML, Ray Serve SageMaker, Vertex AI, Modal SageMaker $0.05-$2/hr per endpoint, Modal pay-per-second Spiky traffic (Modal) or platform team exists (SageMaker)
Monitoring & drift Evidently Arize, WhyLabs Arize $1k-$10k/mo, WhyLabs free tier exists Any model with revenue or compliance impact

Let's go through each one, because the table is the cheat sheet, not the argument.

Warehouse / data layer

Snowflake, BigQuery, or Databricks SQL. Pick the one your data engineering team already pays for. If there is no data engineering team and you're choosing fresh, BigQuery is the cheapest to start (pay-per-query at $6.25/TB scanned, no idle warehouse cost) and Snowflake is the easiest to share with non-technical analysts.

The mistake I see weekly: a DS team trying to "save money" by training models directly off raw Parquet in S3, no warehouse layer, every job re-reads 200GB and rewrites schemas in pandas. That's not saving money. That's burning DS hours, which cost ten times more than the warehouse credits you avoided. Buy the warehouse. Use dbt to transform inside it. Train off curated tables.

Notebook / IDE

Jupyter is free, local, fine for solo work. For teams of three or more, the collaborative notebooks (Hex at $40-$80/user/mo, Deepnote slightly cheaper) earn their keep because they put SQL, Python, and a publishable artifact on one canvas. Stakeholders can read a Hex doc; they can't read your analysis_v7_final.ipynb.

Databricks Notebooks are bundled with Databricks compute. If you already pay for the compute, the notebooks are fine. If you don't, you're paying Databricks platform pricing for what's essentially a hosted Jupyter, and that math doesn't work.

Underrated option: VS Code plus the Jupyter extension. Free, fast, has real git, debugger, and extensions. Most senior data scientists I respect use it for serious work and reserve hosted notebooks for exploration and stakeholder sharing.

Experiment tracking

This is the layer where most teams have three tools because nobody decided. Pick one.

MLflow is open-source and self-hostable on a small VM for around $50/mo. The tracking UI is fine. The model registry is functional. You'll spend maybe one engineering day setting it up and a few hours per quarter maintaining it.

Weights & Biases is the prettiest UI in the category, the easiest to share with stakeholders, and worth paying for (between $20 and $100 per user per month depending on tier) if you run more than twenty experiments a week or if your team genuinely uses the comparison tooling. If two of you run three experiments a quarter, MLflow is fine and W&B is overkill.

Neptune.ai is the cheaper W&B alternative with most of the same features. Worth a look if W&B's pricing scares you.

Whatever you pick, kill the others. The worst experiment-tracking stack is the one where Alice uses W&B, Bob uses MLflow, and the new hire opens TensorBoard because that's what they had at their last job.

Feature store

Feast is open-source and free in dollars. It's not free in hours. You have to host Redis (or another online store), set up the registry, write the materialization jobs, and keep it all running. For a team of two with three models in prod, Feast is theoretical infrastructure and a well-organized dbt project does the same job with a tenth of the maintenance.

Tecton is the enterprise paid option. The starting price is in six figures. It's only justifiable if you have 50+ models in production with real feature reuse across teams. A two-person team buying Tecton is the loudest possible signal of bad capital allocation in this field.

Databricks Feature Store is bundled if you're already on Databricks. Use it if you are. Don't switch platforms to get it.

Honest take: most teams under ten models in prod don't need a feature store yet. They need clean feature pipelines in dbt and a naming convention. Skip the feature store layer until the pain of duplicating features across five training jobs becomes louder than the pain of standing up Feast.

Model serving

The serving layer is where most stacks are over-engineered. Four real options:

SageMaker is AWS-native, complex, and runs about $0.05 to $2 per hour per endpoint depending on instance. It's the right answer if you already use AWS heavily and have a platform engineer to manage endpoints. It's the wrong answer if you're a two-person DS team and you just want a model behind an HTTP endpoint.

Vertex AI is the GCP equivalent. Similar pricing, similar complexity, similar caveats.

Modal is serverless GPU. You pay per second of compute. It's excellent for spiky inference (recommendations on a low-traffic site, batch scoring jobs, anything where you'd otherwise pay for an idle endpoint). The developer experience is the best in the category. It's my default recommendation for indie and small-team setups.

BentoML is an open-source framework. You write your inference logic, BentoML packages it, and you deploy the package on Kubernetes (or Modal, or Lambda, or wherever). Pair it with Modal and you have a serverless GPU stack at startup prices.

The Modal plus BentoML combo is what I'd build today if I were starting a DS team from scratch with no platform team. SageMaker is what you commit to when you have a platform team and a procurement contract that already includes AWS credits.

Monitoring & drift detection

If you have models in production and no monitoring, you don't have models in production. You have time bombs scored by AUC.

Evidently is open-source, runnable as a Python library or a standalone service. It's the right starting point. You can wire it into a notebook and have basic drift reports running in an afternoon.

WhyLabs has a free tier that scales up. Solid choice if you want a hosted dashboard without the budget for Arize.

Arize is the serious paid option, $1k-$10k/mo for production volume. It's worth paying for once you have more than five models in prod or any regulatory requirement (financial services, healthcare, anything with auditors).

Start with Evidently free. Upgrade when the number of models in prod or the compliance pressure justifies it. Don't buy Arize before you have a model that needs monitoring.

The source-of-truth question (where most DS stacks rot)

Garbage labels in, garbage models out. You already know this. What you might not have internalized is where most label garbage comes from: the operational system of record. The CRM. The ticketing tool. The product analytics setup that three different PMs configured three different ways across two reorgs.

If your "customer churned" label comes from a CRM where Sales rep A marks a deal "Closed Lost - No Decision," rep B marks the same situation "Lost - Competitor," and rep C just deletes the deal, no amount of MLflow tracking saves you. Your churn model is learning your reps' inconsistent data hygiene, not customer behavior.

A clean operational system of record matters more than a fancy feature store. It's not glamorous. It doesn't get you a conference talk. But the data scientist who spends a week fixing pipeline-stage definitions and forcing required-field validation in the CRM ships better models for the next two years than the one who switches feature stores three times.

Rework CRM at $12/user/month gives you structured pipeline stages, custom fields with validation, an event log you can stream to your warehouse, and a single source of truth for the customer lifecycle that your churn and conversion models depend on. Whatever CRM you use, the principle holds: the upstream data quality decides the downstream model quality. Fix it before you tune another hyperparameter.

Build vs. buy — the actual decision tree

Here's the matrix. Find your row, build accordingly. Don't skip levels.

Team size Models in prod Recommended stack Total monthly cost
1-3 DS <5 Jupyter + MLflow self-hosted + Evidently + Modal + dbt + your existing warehouse $200-$500
4-10 DS 5-20 Hex + W&B + SageMaker or Vertex + Arize starter + dbt + Snowflake or BigQuery $3k-$8k
10+ DS 20+, regulated Databricks (or full enterprise stack) + Tecton + Arize full + SOC2 audit trail + dedicated platform team $20k+

Don't skip levels. The two most common stack mistakes I see, in order:

  1. The two-person team that bought Tecton because someone watched a conference talk.
  2. The eight-person team still running everything on a single founder's laptop because "we don't need MLOps yet."

Both are bad. The first is over-investment with no payoff. The second is under-investment that bleeds productivity and credibility every week.

The 30-day stack audit

Concrete, week-by-week. Run this whether you've inherited a mess or built it yourself.

Days 1-3: Inventory what's actually deployed

Not what's in the architecture slide. What's actually running. Open every cron, every Airflow DAG, every SageMaker endpoint, every notebook on a schedule. Make a spreadsheet. Columns: tool, owner, monthly cost, percentage of team using it, last-touched date, kill-keep-upgrade.

You will find at least three things you didn't know existed.

Days 4-7: Find every model in prod

For each model: who owns it, what data it trains on, when it was last retrained, what its current performance metric is, and whether anyone would notice if it stopped running.

If nobody would notice, kill it. If nobody owns it, that's now your problem to assign.

Days 8-14: Add monitoring to the worst-monitored model

Pick the model with the highest business impact and the worst monitoring. Add Evidently to it this week. Doesn't have to be pretty. A weekly drift report emailed to a channel is enough to start.

Days 15-21: Consolidate experiment tracking

Pick one tool. Migrate the active experiments. Tell the team to stop using the others. Archive the rest. This will be politically harder than it sounds because the person who set up the tool you're killing will take it personally. Do it anyway.

Days 22-30: Document the stack in one README

A single README in the team repo. Architecture diagram (boxes and arrows, not a Visio masterpiece). Each tool's purpose, owner, and login. The on-call procedure for each model in prod. The next DS hire should be able to read this in an hour and know what they're inheriting.

After 30 days, you can answer in one breath: every model in prod, who owns it, when it was last retrained, what its current drift looks like, and what one tool you'd cut tomorrow. If you can't answer that, the audit isn't done.

Common pitfalls

The greatest hits, in roughly the order I see them:

  • Buying tools before having models in prod. "We need a feature store." Do you? Do you have features? Do you have a model that uses them? Don't buy infrastructure for a future you haven't built yet.
  • Self-hosting MLflow without budgeting maintenance time. It's free in dollars. It's not free in hours. Someone has to keep the VM patched, the database backed up, and the auth working. If that someone is you and you also have to ship models, the math may favor the managed option.
  • Letting each DS pick their own tool. "We use whatever they used at their last job" is how you end up with three experiment trackers, two feature stores, and a 40-page onboarding doc.
  • Building a "platform" before you have three models that justify it. The platform-team-of-one trap. Don't generalize until you have specific things to generalize from.
  • Ignoring the CRM and operational data layer because it's "not ML." It's the layer that decides whether your labels are real. It's ML's foundation, not ML's neighbor.

Templates worth building

Four artifacts to keep in your team repo:

  1. Stack audit spreadsheet. Tool, monthly cost, owner, percentage of team using it, last-touched date, kill-keep-upgrade decision.
  2. "What's actually in prod" inventory. Model, owner, training data source, last retrained, monitoring status, business impact, on-call procedure.
  3. Build-vs-buy decision matrix. The table from this article, customized for your team's specific stack.
  4. Minimum-viable stack repo structure. A working example of MLflow plus BentoML plus Evidently wired together, so the next DS hire can clone it and ship a model in their first week.

The bottom line

The hardest part of an ML stack isn't the ML. It's the boring upstream layer (clean labels, clean events, one source of truth) and the boring downstream layer (monitoring you actually look at). The middle (which model, which feature store, which serving framework) gets the most attention and matters the least.

Tools matter. Stack discipline matters more. The DS who runs the 30-day audit, kills two redundant tools, and writes the README is more valuable than the one who benchmarks five gradient boosting libraries.

If you're hiring for this role, the Data Scientist JD template lays out the responsibilities and the bar. If you're already in the role and your stack looks like the opening paragraph of this guide, start the audit Monday.

Learn More