日本語

Experiment Design That Survives Stakeholder Review

A team I worked with ran a banner-color test last quarter. Three days. Around 200 users per arm. The PM looked at the dashboard, saw p = 0.07 on click-through, and wrote in Slack: "directionally positive, let's ship." Six weeks later the topline metric was flat, the downstream ML personalization model had silently retrained on traffic that was randomized at the session level for a user-level objective, and when the VP asked what the original hypothesis was, nobody could find it.

That experiment had four problems stacked on top of each other: under-powered sample, no minimum detectable effect calculation, wrong randomization unit, and a peek on day 2 that nudged the call. Each one alone would have killed the result. Together they produced a confident decision built on nothing.

This guide is the antidote. It's the design playbook that makes that story impossible to repeat. The one where a PM, an engineering lead, and a skeptical VP can all read the readout doc and reach the same conclusion you did.

The Hypothesis Template

Most experiments fail before any data is collected because the hypothesis is unfalsifiable. "Improve UX." "Make checkout better." "Increase engagement." None of these can be wrong, which means none of them can be right either.

A hypothesis you can defend has four parts:

  • Problem: the specific number you don't like, today, with a baseline.
  • Predicted change: the thing you're going to do, in one sentence.
  • Success metric: the single primary number you'll judge it on.
  • MDE: the smallest effect size that would change your business decision.

Filled in, it looks like this:

Checkout completion rate is 38% (90-day baseline, n ≈ 1.2M sessions). Adding a 4-step progress bar on the checkout flow will reduce drop-off after the address step. Primary metric: completion rate, measured per user, 14-day window. MDE: 1.5 percentage points absolute lift (4% relative). Anything smaller doesn't justify the eng cost.

Notice what the template forces. You commit to a baseline number, so you can't argue post-hoc that the metric was different. You commit to one primary metric, so you can't switch to a secondary when the primary disappoints. You commit to an MDE, so you can't claim a 0.3pp shift "matters." And the MDE is grounded in a business decision (the smallest lift that would actually change what you do next), not a statistical convenience.

Reject vague hypotheses on the way in. If a stakeholder says "we want to test a new layout to see what happens," your job is to push back: what number changes, by how much, and why does that number matter? "Let's see what happens" is a research question, not an experiment.

MDE Math and Sample Size

This is the section that, more than any other, will save you from shipping junk. The math is not optional.

For a two-sample test of proportions with α = 0.05 (two-sided) and power = 0.80, the per-arm sample size is approximately:

n ≈ 16 × σ² / δ²

Where σ² is the variance of the metric and δ is the absolute effect size you want to detect (your MDE). For a binary metric like conversion, σ² ≈ p(1 − p) where p is the baseline rate.

Let's do the checkout example end-to-end.

  • Baseline completion rate: p = 0.38
  • Variance: σ² = 0.38 × 0.62 ≈ 0.2356
  • MDE: δ = 0.015 (1.5 percentage points absolute)
  • δ² = 0.000225
n ≈ 16 × 0.2356 / 0.000225
n ≈ 16,755 per arm

So roughly n = 17,000 users per arm, n = 34,000 total, to reliably detect a 1.5pp lift on a 38% baseline at 80% power. If the daily eligible user volume is 5,000, that's a 7-day test minimum. If you want a 1pp MDE instead, the denominator drops by ~2.25x and you need n ≈ 38,000 per arm, closer to a 16-day test.

Now look at the banner test from the opening: 200 users per arm, baseline click-through around 8%. Variance ≈ 0.074. To detect a 1pp lift at 80% power, n ≈ 16 × 0.074 / 0.0001 = 11,840 per arm. They had 200. The test was mathematically incapable of detecting the effect they were hoping for. The p = 0.07 they cited wasn't a near-significant signal; it was random noise on a sample that couldn't have signaled anything.

A few practical notes:

  • The 16 in the formula comes from (z_α/2 + z_β)² × 2 for α = 0.05, power = 0.80. For 90% power use ~21. For α = 0.01 (Bonferroni-corrected for 5 metrics, say) the constant climbs further.
  • For continuous metrics (revenue per user, session length), use the actual sample variance, and beware of heavy tails. Capping or log-transforming the metric is often the right move; do it before you run, not after.
  • Sample size scales with 1/δ². Halving the MDE quadruples the required sample. This is why "let's just run it longer if it doesn't pop" is a fantasy.

If your sample-size calculator says you need 38,000 users per arm and the team only has 5,000 per week, your options are: run for 8 weeks, accept a larger MDE (and admit you can't detect smaller wins), or pick a different experiment. There is no fourth option where math bends.

Randomization Unit: User vs Session vs Cluster

Picking the wrong randomization unit is the silent killer of A/B tests. You'll get a clean p-value on the wrong question.

User-level randomization is the default for most product experiments. A user is assigned to a variant the first time they hit the experiment, and they stay in that variant forever (or at least for the test window). This is correct when the metric is computed per user: retention, LTV, purchase frequency, 7-day return rate.

Session-level randomization assigns each session independently. This works for stateless, single-session metrics like page load time or single-session conversion on a landing page where users don't return. It breaks badly when the metric compounds across sessions. If you randomize a recommendation algorithm at the session level and measure 30-day retention, you've just shown a user three different recommendation experiences over 30 days; you're measuring the average of A and B, not A vs B.

Cluster randomization is for marketplace, network, and social effects. If the variant changes how supply meets demand (a new ranking algorithm in a marketplace, a feed change that affects what other users see), you cannot randomize individual users. They spill over into each other's experience. Randomize at the geo level, the marketplace level, or the social cluster. The cost is that your effective n drops to the number of clusters, not the number of users, and your sample size calculation needs to use cluster-level variance (which is usually much higher than user-level variance).

The diagnostic question: "If I assigned user A to control and user B to treatment, can user A's outcome be influenced by user B's experience?" If yes, you have interference, and you need cluster randomization or a switchback design.

The session-level mistake from the opening test was exactly this. Click-through is a session metric, technically, so session-level randomization passed sniff-check. But the downstream model that retrained on the data needed user-level signal. The randomization unit must match the analysis unit, and both must match the decision unit.

Guardrail Metrics

The primary metric tells you whether the change worked. Guardrails tell you whether it broke something else.

Pre-register two to four guardrail metrics that must not regress beyond a threshold, even if the primary wins. Standard guardrails:

  • Latency (p95 page load, API response time): many "wins" are wins because the new variant loaded faster, not because the change was good.
  • Error rate (5xx, client-side JS errors): a treatment that doubles error rate is shipping a bug, regardless of what conversion does.
  • Revenue per user: if you optimize click-through and revenue per user drops, you found a way to make people click on lower-value things. Don't ship.
  • Support ticket rate: UX changes that confuse users show up here, not in the conversion metric.

The threshold matters. A common pattern: "guardrail must not regress by more than 1% relative, or the experiment fails regardless of primary outcome." Pre-register the threshold. Otherwise, when latency comes in 2% slower, the conversation becomes "is 2% really meaningful," and you're negotiating with yourself.

The point of guardrails is to catch the experiment that won the primary but tanked the business. They are the most underused tool in DS work, and the cheapest insurance you can buy.

The Readout Doc

Same shape, every time. The readout doc that survives review is one page, scannable in 90 seconds, with no surprises in the appendix. Here's the template:

  • Hypothesis: one paragraph, the four-part template above, written before the experiment started.
  • Design: randomization unit, sample size target, MDE, primary metric, guardrails, traffic allocation.
  • Dates and sample: start date, end date, actual sample size achieved per arm.
  • Primary result: point estimate, 95% confidence interval, p-value. One line.
  • Guardrails: table of guardrail metrics with delta, CI, and pass/fail vs pre-registered threshold.
  • Pre-registered segment cuts: same metric, broken out by the segments you committed to in advance.
  • Decision: ship / don't ship / iterate, with the rationale tied directly to the result.
  • Rollback plan: if shipped, how do we monitor in production, and what triggers a rollback?

What's not in the readout: post-hoc segment cuts presented as findings, narrative reframings of the hypothesis, or "directional" calls. If a segment cut is exploratory, label it exploratory in a clearly marked section. The reviewer should be able to tell at a glance which numbers were planned and which were fishing.

The discipline is the template. When every experiment in the org uses the same one-pager, reviewers stop having to learn each DS's personal style and start being able to actually evaluate the work.

Why Most Experiments Fail

After enough readouts, the failure modes cluster into a short list:

  • Under-powered. The MDE math wasn't done, or it was done and ignored. The test could not have detected the effect being claimed.
  • Unclear hypothesis. No falsifiable prediction, no committed primary metric, no MDE. The experiment "succeeds" no matter what the data says.
  • Wrong randomization unit. Session-level for a user-level question, or user-level for a marketplace question with interference.
  • No guardrails. The primary won, the team shipped, latency regressed 8%, and three weeks later someone notices revenue is down.
  • No rollback plan. Code shipped, the experiment was declared done, and nobody monitored production. The change drifts and nobody can attribute the drift back to the launch.
  • Confounded with another release. The experiment ran during a marketing push or a UI refresh that hit both arms. The estimated effect is the experiment plus the confound, and you can't separate them.

Every one of these is preventable in the design phase. None of them is fixable after data collection.

HARKing Avoidance

HARKing (Hypothesizing After Results are Known) is the most common form of self-deception in experimentation. The pattern: you ran a test on the whole user base, the primary was null, but the variant looks great for "users on iOS in the US who arrived via paid search." So that becomes the headline.

The problem is purely statistical. If you cut your data into 20 segments, you'd expect one of them to hit p < 0.05 by chance alone. Picking the winner after looking at all 20 and presenting it as a confirmed result is, mathematically, fraud. You'd find the same "effect" on a coin flip if you sliced finely enough.

The fix is pre-registration. Before the experiment starts, write down:

  1. The primary metric.
  2. The exact segment cuts you'll report (e.g., new vs returning users, mobile vs desktop, top-3 markets), and only those.
  3. Any subgroup you commit to as a confirmatory analysis.

Anything you find later goes in a clearly labeled "Exploratory" section, with a note that p-values are not corrected for multiple testing and that the finding needs a follow-up experiment to confirm. Never call a post-hoc subgroup result "significant." Call it a hypothesis for the next test.

The cultural fix is harder than the technical one. When a stakeholder is desperate for a win and the post-hoc cut delivers it, the pressure to launder it as a confirmed result is real. The discipline of pre-registration (writing it down before) is what gives you the standing to push back.

Peeking Discipline

Here's a number that surprises people: if you check your A/B test for significance every day for two weeks, your effective false positive rate is not 5%. It's closer to 14%. Maybe higher, depending on how aggressive you are about stopping early.

The reason is the sequential testing problem. A standard t-test or z-test is calibrated for a single look at the data, after a pre-committed sample is collected. Each additional look is another chance for random noise to cross the threshold. If you peek and stop, you're cherry-picking the most extreme moment in a random walk and reporting it as a fixed result.

You have two clean options:

  1. Commit to the sample size. Calculate n, run the test until you hit n, then look at the result once. No daily dashboards driving stop/ship decisions. Monitoring guardrails for safety is fine; using the primary metric to call the experiment early is not.
  2. Use a sequential testing method. mSPRT (mixture sequential probability ratio test), group sequential designs with alpha-spending functions, or properly-implemented Bayesian methods with informative priors. These let you peek as often as you want with valid inference, at the cost of a slightly higher required sample to compensate.

What you cannot do is run a fixed-horizon test, peek daily, and stop the moment p crosses 0.05. That is the most common false positive generator in industry experimentation, and it is the reason "shipped wins" routinely fail to replicate when measured properly later.

The fix is procedural. Write the stop rule into the design doc. "We will run for n = 17,000 per arm, expected 8 days, and read out once." If the team can't resist the dashboard, hide the primary metric from the live view and only surface guardrails. The discipline is the design.

Closing

The readout doc that survives review is the one where the design decisions were made before data collection started. The hypothesis was specific. The sample size was calculated. The randomization unit was justified. The guardrails were pre-registered. The segment cuts were committed in advance. The stop rule was written down.

Everything else is storytelling. And storytelling is fine for the narrative section of the readout, but it cannot be the basis of a ship decision.

The fight you win is the one fought before data collection. Spend the hour on the design doc. It's the cheapest hour you'll spend all quarter, and it's the one that decides whether your experiment survives review or quietly joins the pile of "directionally positive" tests that nobody can reconstruct six months later.

Learn More