Bahasa Melayu

SQL and Data Modeling That Scales Beyond the First Dashboard

I've inherited three of these warehouses now and the diagnosis is always the same. There's a query someone pasted into a Slack DM at 4pm on a Friday. Marketing built a dashboard on top of it. Sales built another dashboard on top of that. Six months later, the CFO opens both in the same meeting, sees two different revenue numbers, and stops trusting the data team for the rest of the quarter.

The fix isn't smarter SQL. It's discipline: layers, tests, naming, and a deprecation cadence. This guide is that discipline, written for analysts who already know how to JOIN and want to stop being on-call for "why does this dashboard say something different."

The Slack-Thread Query

Here's how it starts. Someone in #revops asks a question. You write a 60-line query against the production replica, drop the answer in the thread, and move on. That query gets copied into a Mode notebook. The Mode notebook becomes a Looker tile. The Looker tile becomes a board-deck chart.

Ninety days later the underlying schema changes, the query silently breaks, and nobody notices because the chart still renders — just with bad numbers. Or worse, the query never broke but it never had a single named definition of "active customer" and now three teams are staring at three different counts.

The Slack-thread query is fine as a Slack-thread query. The problem is the path from "lookup" to "load-bearing dashboard" with no checkpoint in between. You need a checkpoint. The checkpoint is a model.

Two Definitions of Revenue Is a Trust Event

The day finance and product show up to the same meeting with two different revenue numbers, you've lost a year of trust and you'll spend six months getting it back. I've watched it happen at three different companies. It's not a tooling problem. Snowflake didn't fail you. dbt didn't fail you. What failed is that nobody made the call about what "revenue" means before five people wrote five queries.

This is preventable engineering. Layers and tests prevent the two-revenues problem. Naming and deprecation prevent the warehouse from rotting. The work is boring on purpose. If your data stack feels exciting day-to-day, something is probably on fire.

SQL Practices That Actually Scale

Four habits separate analysts who can hand off their warehouse from analysts whose code dies the moment they leave.

CTEs over nested subqueries. A nested subquery is a write-only artifact. You can write it, but six weeks later you can't read it, and neither can the next person. CTEs let you name each step, debug each step, and reason about one mental model per block.

-- Don't do this
SELECT customer_id, SUM(amount)
FROM (
  SELECT * FROM orders
  WHERE status = 'completed'
    AND created_at >= (SELECT MIN(created_at) FROM orders WHERE ...)
) o
JOIN (SELECT * FROM customers WHERE deleted_at IS NULL) c
  ON o.customer_id = c.id
GROUP BY 1;

-- Do this
WITH completed_orders AS (
  SELECT *
  FROM {{ ref('stg_shopify__orders') }}
  WHERE status = 'completed'
),

active_customers AS (
  SELECT *
  FROM {{ ref('stg_shopify__customers') }}
  WHERE deleted_at IS NULL
),

revenue_by_customer AS (
  SELECT
    c.customer_id,
    SUM(o.amount) AS lifetime_revenue
  FROM active_customers c
  LEFT JOIN completed_orders o
    ON o.customer_id = c.customer_id
  GROUP BY c.customer_id
)

SELECT * FROM revenue_by_customer;

The second version is longer. It's also debuggable. You can highlight any CTE, run it alone, and verify the row count. That's the whole game.

Named tests on every model. Not "I'll add tests later." On every model, in the same PR. The four that earn their keep:

# models/marts/_marts.yml
version: 2

models:
  - name: dim_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'churned', 'paused']
      - name: account_owner_id
        tests:
          - relationships:
              to: ref('dim_users')
              field: user_id

Four tests, fifteen lines of YAML, and you'll catch ninety percent of the silent breakage that would otherwise show up in a board deck. unique and not_null on every primary key. accepted_values on every enum-shaped column. relationships on every foreign key. If you don't test the keys, you don't have a model. You have a query.

One single-source-of-truth model per business entity. One dim_customers. One fct_orders. One dim_users. Not dim_customers_v2, not dim_customers_marketing, not dim_customers_for_finance_q3. If marketing and finance need different customer cuts, they get different columns on the same dim_customers, or they build downstream marts on top of it. They don't get to clone the table.

The clone is how you get the two-revenues problem. Kill it on sight.

Window functions over self-joins. If you're self-joining a table to itself to compute "previous order date" or "rank within customer", you're working too hard. LAG(), LEAD(), ROW_NUMBER(), and RANK() are doing the same job in one pass with half the lines and a tenth of the warehouse cost.

Data Modeling Layers — Staging, Intermediate, Marts

This is the part nobody wants to do because it feels like extra work. It's not extra work. It's the work that makes everything after it possible.

Staging (stg_). One model per source table. Rename columns to your project's convention. Recast types. Nothing else. No joins, no business logic, no aggregations. The staging layer's job is to give every downstream model a clean, predictable, named view of one source table. If you skip staging because "it's just a rename", you'll regret it the day Stripe renames a column or your Fivetran sync changes case sensitivity.

-- models/staging/stripe/stg_stripe__charges.sql
WITH source AS (
  SELECT * FROM {{ source('stripe', 'charges') }}
),

renamed AS (
  SELECT
    id              AS charge_id,
    customer        AS customer_id,
    amount / 100.0  AS amount_usd,
    status          AS charge_status,
    created::TIMESTAMP AS charged_at
  FROM source
)

SELECT * FROM renamed;

That's the entire model. It's boring. That's the point.

Intermediate (int_). This is where joins, business logic, and reusable building blocks live. Intermediate models are never exposed to BI. They exist to keep your mart models readable and to let you reuse pieces. If three marts all need "monthly recurring revenue per customer", that's an int_customer_mrr_monthly model, built once, referenced three times.

-- models/intermediate/int_customer_mrr_monthly.sql
WITH subscriptions AS (
  SELECT * FROM {{ ref('stg_stripe__subscriptions') }}
),

months AS (
  SELECT * FROM {{ ref('int_calendar_months') }}
),

active_in_month AS (
  SELECT
    m.month_start,
    s.customer_id,
    s.monthly_amount_usd
  FROM months m
  LEFT JOIN subscriptions s
    ON m.month_start BETWEEN s.started_at AND COALESCE(s.canceled_at, '9999-12-31')
)

SELECT
  customer_id,
  month_start,
  SUM(monthly_amount_usd) AS mrr_usd
FROM active_in_month
GROUP BY 1, 2;

Marts (fct_ / dim_). This is what BI tools query. Denormalized, business-readable, and named the way an executive would name them. fct_ for fact tables (orders, charges, sessions). dim_ for dimension tables (customers, products, employees). Your mart layer is your warehouse's API. Treat changes to it like API changes: versioned, communicated, deprecated with notice.

A ten-person data team usually ends up with somewhere between 200 and 400 models across these three layers. Roughly 50-60% staging, 25-30% intermediate, 15-20% marts. If your mart count balloons past that ratio, you're doing business logic in marts that should live in intermediate.

The "Two Definitions of Revenue" Diagnostic

You can run this in twenty minutes. Do it before you write a new model in a warehouse you've inherited.

  1. Pull every dashboard tile, BI report, and saved query that has the word "revenue" in its title or column.
  2. For each one, copy the SQL into a single doc.
  3. Diff the WHERE clauses, the JOIN conditions, and the SUM targets.
  4. Make a table: dashboard name, owner, definition, source.
  5. Walk it to finance. Get one definition committed in writing.
  6. Build one fct_revenue model that matches the committed definition.
  7. Migrate every dashboard to that model. Deprecate the old queries.

Most teams find three to five different revenue definitions in the first audit. Sometimes they're all "right" for some narrow purpose, but only one of them is what the CFO will defend on an earnings call. That's the one that lives in the model. The others get rewritten as views on top of it (vw_revenue_marketing_attributed, vw_revenue_booking_basis) so it's obvious they're cuts of the same number, not separate numbers.

Run the same audit on "active customer", "churn", "ARR", and "MRR". You'll find the same problem. Fix it the same way.

dbt Naming Conventions

Naming isn't bikeshedding. It's how the next analyst orients themselves in your warehouse without asking you a question.

  • stg_{source}__{entity}. Double underscore separates source from entity. stg_stripe__charges, stg_hubspot__contacts, stg_shopify__orders. The double underscore is the visual cue that the left side is a system name.
  • int_{entity}_{verb}. Describes what the model does. int_orders_pivoted, int_customers_enriched, int_sessions_attributed.
  • fct_{event} and dim_{entity}. Examples: fct_orders, fct_charges, fct_page_views, dim_customers, dim_products, dim_dates.
  • Source-aliasing in tests, not column-aliasing in mart SQL. Your dim_customers.customer_id should be customer_id, not dim_customers__customer_id.

The _v2 suffix is a smell. If you see dim_customers_v2.sql next to dim_customers.sql, somebody got scared to break the original and shipped a fork. Versioning belongs in git, not in the filename. The right move is: rewrite dim_customers, ship the breaking change with a migration note, deprecate the old behavior on a calendar. Keeping both forever guarantees the two-revenues problem will return wearing a different hat.

Version Control + Change Log Discipline

No direct merges to main. Ever. Even for the one-line typo fix. The moment you carve out an exception, it becomes the rule.

Every PR has a one-line CHANGELOG entry in CHANGELOG.md:

## 2026-04-29
- `fct_orders`: added `is_first_order` column. Affects: marketing-attribution, lifecycle-emails dashboards. (@camellia)
- `breaking: dim_customers`: renamed `acct_owner` → `account_owner_id`. Affects: all CRM dashboards. Migration: see #1284. (@camellia)
- `stg_stripe__charges`: handle null `customer_id` from guest checkouts. (@camellia)

Three rules on the CHANGELOG:

  1. Every PR adds an entry. No PR ships without one. CI can enforce this.
  2. Schema changes get tagged breaking: and pinged in #data-changes before the PR merges.
  3. The "affects" field names dashboards or downstream models, not vague phrases like "various reports". If you can't name what it affects, you don't know your downstream graph well enough to merge.

Schema changes also get a calendar deprecation. The old column stays for 30 days with a comment, the new column gets added in parallel, downstream consumers migrate, then the old column drops. No exceptions for "easy" renames. Easy renames are how you discover that someone built a Zapier flow on top of your warehouse and nobody told you.

Dashboard vs One-Off — When to Build What

Decision rule: will this number be asked for again within 30 days?

  • Yes → it's a dashboard. The number lives behind a model. The dashboard has an owner.
  • No → send the CSV. Link the SQL in the request thread or PR. Move on.

The mistake is building a dashboard for every ad-hoc ask. Six months later you have 400 dashboards, 350 of them haven't been opened since the day they shipped, and the analyst who built them left the company.

Dashboards have an owner field. Not "the data team". A name. If a dashboard exists without a named owner, archive it. The owner's job is to answer "is this still right?" when stakeholders ask. If nobody owns it, nobody can answer, and you're back to the trust problem.

One-off SQL goes in a analyses/ folder in your dbt project, committed and reviewed but not modeled. That gives you a paper trail without inflating the warehouse.

Deprecation Cadence

Warehouses rot the way attics rot — slowly, then all at once. The cadence that prevents it:

  • Quarterly review of dashboards by last-viewed date. Most BI tools expose this. Pull a list every quarter.
  • 60 days unviewed → deprecation banner. A visible "this dashboard is scheduled for archival on YYYY-MM-DD" message. Pings the owner. Forces a decision.
  • 90 days unviewed → archived. Move it to an archive/ folder. Mark the underlying model enabled: false in dbt config so it stops building.
# models/marts/legacy/_legacy.yml
models:
  - name: fct_orders_2024_legacy
    config:
      enabled: false
    description: |
      Archived 2026-04-29 — superseded by fct_orders.
      Last queried 2026-01-12. Owner left company 2025-Q4.

After a year of this cadence, you'll cut your model count by 30-40% with zero downstream complaints. The models you cut were already dead. You're just making it official so they stop costing you compute and confusing new analysts.

Common Pitfalls

The four I see most often, ranked by how much damage they do:

  1. Modeling before agreeing on definitions. Build the audit. Get the definition. Then build the model. The other order produces the two-revenues problem on a fresh warehouse.
  2. Skipping staging because "it's just a rename". Skipping staging means every downstream model breaks the day a source schema changes. Staging is the shock absorber. Pay the boring tax.
  3. Letting _v2 files multiply. Every _v2 is a confession that someone was scared to break the original. Fix the original. Communicate the change. Ship one model.
  4. No owner on critical models. A fct_revenue with no owner is a liability with a SQL file attached. Owner field is mandatory in your model YAML, same as tests.

What "Done" Looks Like

You've built this discipline when:

  • Every business metric has exactly one model that defines it.
  • Every model has at least unique and not_null tests on its primary key.
  • Every dashboard has a named owner.
  • A new analyst can trace any number on any dashboard back to its source (through marts, intermediate, staging, source) in under ten minutes, without asking anyone.
  • The CHANGELOG tells you what changed last week without opening Slack.
  • Your compute spend is flat or dropping while your model count grows, because you deprecated the dead stuff.

That's the bar. It's boring. It's also how you stop being the team that gets paged at 11pm because the board deck doesn't tie out.

Learn More