Bahasa Melayu

Growth Experiment Design: Hypothesis to MDE to Readout

The first time I shipped a "winner" that lost us 3% of trials was a pricing-page CTA test. Three days, 240 visitors, variant B up 6.2%. The PM hit Slack with the trophy emoji. Two weeks later our weekly trial volume looked off, I re-ran the math, and the original "winner" was inside the noise band the whole time. We'd shipped nothing, except a quarter of roadmap built on a vibe.

That's the job, honestly. Not the testing. The not-being-fooled part.

This guide is the playbook I wish someone had handed me on day one: how to write a hypothesis that can actually be killed, how to do the sample-size math on a napkin, how to pick between ICE and RICE without lying to yourself, and how to write a readout your CFO will actually open. If you take one thing from it, take this: the test isn't the deliverable. The readout is. Most quarters, that's 4 readouts, not 40.

Why most B2B experiments fail

I've audited maybe 60 growth experiments across SaaS and PLG companies in the last few years. The failure modes cluster into five things, and four of them are decided before the test goes live.

1. Under-powered from day one. Team runs a 5-day test on 800 users with an 8% baseline conversion, sees a 0.4 percentage-point lift, calls it a winner. The actual MDE on that sample size is something like 3 percentage points relative to the baseline. Anything smaller than that is statistically indistinguishable from coin flips. They didn't run a bad experiment. They ran a test that was incapable of detecting the effect they were hoping for.

2. Hypothesis written as a to-do. "We will test a new headline on the pricing page." That's not a hypothesis. A hypothesis predicts what changes, by how much, for whom, and why. If your hypothesis can't be falsified by the data, the test will produce a story no matter what happens.

3. No rollback plan. Variant ships, conversion drops 4%, nobody owns the rollback decision, the variant runs for another 6 weeks because "we want more data." You don't need more data. You need a pre-registered stopping rule.

4. No primary metric pre-registered. Three days in, conversion is flat but time-on-page is up 22%, so suddenly time-on-page is the metric. This has a name: HARKing (Hypothesizing After Results are Known). Every team does it. Every team that does it produces unreliable readouts.

5. The PM eyeballs the chart on day 3. Peeking. Sequential testing exists for this exact reason, and most teams aren't using it. A standard fixed-horizon test loses its statistical guarantees the moment you make decisions based on interim looks.

The first four are solved by a hypothesis template. The fifth is solved by a sample-size calculator and the discipline to leave the dashboard alone.

The hypothesis template

Copy this. Paste it into your team's experiment tracker. Make it the only template anyone is allowed to use.

EXPERIMENT: [short name, e.g. "Pricing-page social proof block"]

PROBLEM (what we observed in the data):
  We see X behavior in segment Y. Specifically:
  - Data point 1: [from analytics, support tickets, sales calls, qual]
  - Data point 2: [confirming or triangulating]

PREDICTED CHANGE (what we'll do, for whom):
  For [segment], we will [change], because [mechanism we believe is at work].

SUCCESS METRIC:
  Primary:  [one metric, with current baseline number]
  Guardrail 1: [must not move worse than X]
  Guardrail 2: [must not move worse than Y]

MDE (the smallest effect we'd act on):
  We need to detect a [N%] relative lift on the primary metric.
  Below that, the change isn't worth the engineering cost / brand risk / focus.

SAMPLE SIZE & DURATION:
  Per arm: [N] users
  Estimated duration at current traffic: [N weeks]

ROLLBACK CRITERIA:
  We kill the variant immediately if:
  - Primary moves worse by more than [X]
  - Either guardrail breaches its threshold for >48h
  - Engineering finds a P0/P1 bug

DECISION DATE: [a real date — not "when we have enough data"]
OWNER: [one person]

Two notes. First, the MDE line is not a wish. It's the threshold below which you wouldn't ship the change anyway, even if it were "real." If a 1.5% lift on activation isn't worth the maintenance cost of carrying the variant in code forever, then a 1.5% lift isn't your MDE. Your MDE is whatever number actually clears the cost. Be honest there.

Second, the decision date kills more zombies than anything else in this template. Without it, every test runs forever.

MDE math you can do on a napkin

Here's the formula I use for planning, which a real statistician would mildly object to but which gets you within 10% of the truth and is fast enough that you'll actually use it:

n per arm  ≈  16 × p × (1 - p) / MDE²

Where:

  • p is your baseline conversion rate (as a decimal, e.g. 0.08 for 8%)
  • MDE is the absolute lift you want to detect (as a decimal, e.g. 0.008 for an 8.0% → 8.8% move, which is a 10% relative lift)
  • 16 bakes in 80% power and 95% confidence (two-sided)

That's it. No software needed. Let's run a real one.

Worked example: an 8% trial-to-paid conversion

Your B2B SaaS has 600 weekly signups. Trial-to-paid is 8% (so p = 0.08). You want to detect a 10% relative lift, meaning 8.0% → 8.8% absolute (so MDE = 0.008).

n per arm  =  16 × 0.08 × 0.92 / (0.008)²
           =  16 × 0.0736 / 0.000064
           =  1.1776 / 0.000064
           ≈  18,400 users per arm

Two arms = 36,800 users. At 600 signups/week split 50/50 across the test, that's roughly 6 to 8 weeks of traffic for one experiment. Not 5 days.

Now, if you want to detect a 25% relative lift (8.0% → 10.0%), the math gets friendlier:

n per arm  =  16 × 0.08 × 0.92 / (0.02)²
           =  1.1776 / 0.0004
           ≈  2,944 per arm

About 6,000 users total. At 600/week, ~2 weeks. The catch: 25% relative lifts on trial-to-paid are basically unicorns in mature B2B funnels. You'll get one or two a year if you're good. Most real wins are 3–8% relative, which means most of your tests need months of traffic, not days.

This is the part nobody wants to hear: your funnel doesn't move 25%, so your experiments need to be powered for the lifts that actually exist. Hand-wave this and every test becomes a Rorschach.

When "we'll just run it longer" is wrong

If a test was under-powered on day one, running it longer at fixed-horizon settings doesn't fix it. It inflates your false-positive rate, because you're effectively peeking. If you genuinely need flexibility on duration, switch to:

  • Sequential testing (msPRT, always-valid p-values): lets you stop early or extend without breaking the math. Statsig, GrowthBook, and Eppo all support it natively.
  • CUPED (variance reduction using pre-experiment data): can cut required sample size by 30–50% on metrics with strong pre-period signal. Worth turning on for any major test.

Don't try to roll these by hand. Use the platform.

Common diagnoses to know by name

If you can name the failure mode, you can argue against it in a readout review. The five I see most:

  • HARKing: picking the metric after seeing the result. Solved by pre-registering primary + guardrails before launch.
  • Peeking: making decisions on interim looks at fixed-horizon tests. Solved by sequential testing or by genuinely not looking until the decision date.
  • Novelty effect: variant wins for two weeks because it's new, then regresses. Solved by extending tests on UI changes and watching week 3+ behavior.
  • Simpson's paradox: variant wins overall but loses in every segment, because the mix shifted. Solved by always pre-segmenting your readout (new vs returning, by plan, by source).
  • Survivorship bias in cohort metrics: measuring "retention at week 4" only on users who made it to week 4 inflates the number. Solved by anchoring cohorts at the entry event.

Prioritization: ICE vs RICE vs PIE

Three frameworks, slightly different ingredients, all of them lying to you in different ways.

Framework Ingredients Best for Where it breaks
ICE Impact × Confidence × Ease (1–10 each) 2–5 person teams; back-of-napkin Subjective. Authors score their own ideas. "Ease" is usually wrong.
RICE (Reach × Impact × Confidence) / Effort 10+ person teams; portfolio across segments "Reach" hides traffic differences across funnel stages; effort still self-scored.
PIE Potential × Importance × Ease (1–10 each) CRO-heavy, page-level optimization Assumes you can estimate "potential" from page traffic — usually false in B2B.

My honest take: ICE is fine for a 2-person team and lies for a 20-person team. When your team is small enough that everyone has read every doc, ICE is just a way of writing down a conversation you'd have anyway. Once the team is big enough that the ICE score is the only artifact a stakeholder reads, every PM games it.

The trap with all three: you're scoring your own experiments. Owners over-weight Confidence on their own ideas. Engineers under-weight Ease on someone else's. The score becomes a proxy for office politics.

What I run instead at scale: a Confidence × Reach 2x2 with no math. Top-right (high confidence, high reach) ships now. Top-left (high confidence, narrow reach) ships if it's cheap. Bottom-right (low confidence, broad reach) becomes a paid research investment. We'll fund the test on the basis of learning value, not expected lift. Bottom-left dies. Reviewed weekly, in a 30-minute meeting, with the head of growth holding the marker.

It's not a number. It's a forcing function for honest conversation.

WIP limit: 3–5 live tests max

For most B2B teams under 500 employees, the right number of concurrent experiments is 3 to 5. Above that, you eat your own traffic, your interaction effects get untraceable, and your team can't actually pay attention to the readouts. The constraint isn't engineering velocity. It's traffic and attention.

The readout doc (this is the actual deliverable)

Every shipped, killed, or inconclusive test gets a one-page readout. Not a dashboard. A doc. Saved in the same folder forever.

READOUT: [experiment name]
DATES: [start] → [stop]
OWNER: [name]
STATUS: ✅ Shipped / ❌ Killed / 🤷 Inconclusive

WHAT SHIPPED
  Variant B replaced [X] with [Y] on [page/flow], for [segment], from [date] to [date].

WHAT WE MEASURED
  Primary:    [metric] — control [N], variant [N], delta [+X% / -Y%], p = [N], CI [N, N]
  Guardrail 1: [metric] — flat / breached
  Guardrail 2: [metric] — flat / breached
  Sample:     [N per arm] — powered for [MDE]% relative lift

WHAT WE LEARNED
  - Result interpretation in 2–3 sentences. No "we crushed it." Yes "trial-to-paid moved
    +4.2% (CI 1.1–7.3%), within our pre-registered MDE of 4%, so we ship."
  - Segment splits: [where the effect was strongest / weakest]
  - Anything weird: [novelty signal? guardrail noise? data quality?]

WHAT WE'RE DOING NEXT
  - Ship / hold-out plan for variant
  - Follow-up tests (max 2)
  - Anything that needs eng, product, or design attention

WE WERE WRONG ABOUT ___
  One sentence. The thing we believed going in that the data disproved (or refused to confirm).

The "we were wrong about" line is the secret weapon. It does three things:

  1. Builds team trust. Leaders see you're not packaging losses as wins.
  2. Compounds learnings. Over a year you have 30+ "we were wrong about" lines, and patterns emerge ("we keep over-estimating the impact of pricing-page changes").
  3. Calibrates future Confidence scores. Your priors get sharper.

If your readouts don't have a "we were wrong" line for at least 60% of completed tests, you're either testing only safe things or you're rewriting history. Both are bad.

Where the hypothesis backlog comes from

A test pipeline starves if the team doesn't have a structured way to source ideas. Five sources I trust, in roughly descending order of signal:

  1. Funnel diffs: segment X converts at half the rate of segment Y at the same step. Go figure out why. This is where the biggest, most defensible wins live.
  2. Qual interviews: 5 churned customers, recorded, transcribed. You will hear the same friction in 3 of them. That friction is your next hypothesis.
  3. Sales-call recordings: Gong/Chorus is a goldmine. Search "I wish it could" or "the thing that confused me." Each one is a hypothesis with confidence pre-baked.
  4. Support tickets: same idea, lower-funnel. Cluster by topic. The biggest cluster is often a 2-week eng fix that lifts activation more than your last 6 tests combined.
  5. Competitor teardowns: useful but dangerous. You'll over-weight novelty. Tag these as low Confidence by default.

Score each idea against the hypothesis template before it enters the prioritization queue. If you can't fill in the Problem section with two real data points, the idea isn't ready. It's a guess. Send it back for research.

Killing zombie experiments

Every growth team I've seen has them: tests still serving traffic to a variant nobody owns, behind a flag nobody remembers, on a page nobody audits. Three rules:

  • The 90-day rule. If a test has been live more than 90 days without a readout, it's killed by default at the next quarterly review. No exceptions for "we're waiting on more data." If a test needs 4 months to reach significance, it was under-powered on launch and the right answer is to stop and re-design.
  • Quarterly graveyard review. Once a quarter, audit every active flag in your experimentation platform. Match each one to an owner and a readout doc. Anything orphaned ships back to control and the flag gets deleted from the codebase.
  • The "still serving traffic" audit. Pull the list of all experiment-eligible URLs and cross-check against active tests in the platform. Every gap is either a config bug or a zombie. Fix both.

The team that runs this audit honestly will find that 30–40% of their "active" tests are dead weight. Killing them frees traffic and attention for tests that can actually learn.

The Growth IC's actual job

I'll close where I opened. The IC's job isn't to ship more tests. It's to ship more learnings. Most quarters, that's 4 well-designed, properly-powered, honestly-readout tests, not 40 shrugs.

A good experimentation practice looks slow from the outside. The team is running 3 tests, not 30. Half the readouts say "we were wrong." The PM defending peeking gets pushed back on. The CFO actually opens the readout doc and asks a question about the guardrail metric.

That's the job working. The trophies in Slack come later, and they're real because the math was real.

Learn More