日本語

User Research That Moves the Roadmap, Not Just Confirms Hunches

The PM had already decided. The roadmap doc was three sprints deep, the Jira tickets were drafted, and the engineering manager had a rough estimate. Then the design org said "we should probably talk to users first," and a research project was scheduled.

Five calls. Five customers the CSM hand-picked because they were "engaged and happy to chat." A Notion doc with sixteen quotes, four of them gently supportive, none of them surprising. The PM read the summary, said "this confirms what we thought," and shipped the feature six months later. Adoption capped at 11%. The post-launch retro called it a "messaging problem."

It wasn't a messaging problem. It was a research problem dressed up as user research. The study didn't fail because the methodology was wrong. It failed because nobody was willing to let it change the decision. That's not research. That's ceremony.

If you've been in this loop and you're tired of it, this is the playbook. It covers when to use which type of research, how to size studies properly, how to write findings that survive a skeptical PM, and how to escape the "users said they want X" trap that quietly kills SMB roadmaps in favor of whoever screamed loudest in the last QBR.

Generative vs evaluative — pick the right tool first

The fastest way to waste a research budget is to pick the wrong category. Generative research answers "what problem are we solving?" Evaluative research answers "does this thing we built solve it?" They look similar from the outside (both involve users, both produce quotes), but they answer questions in opposite directions and require different sample sizes, different recruitment, and different stakeholder buy-in.

Most B2B SaaS teams reach for evaluative when the real question is generative. Someone says "we need to test the new dashboard," and a usability test gets scheduled before anyone has asked whether the dashboard is the right thing to build at all. By the time the test runs, the answer it can give is narrow: do users find the buttons. The question that mattered (should this dashboard exist?) is now off the table because the prototype is already built.

Research type Question it answers Methods Sample size Time When to use
Generative What's the problem? 1:1 interviews, diary studies, contextual inquiry, jobs-to-be-done 8-15 per segment 2-4 weeks Before scoping, before designs
Evaluative (qual) Does this work? Moderated usability tests, concept tests 5-8 per segment 1-2 weeks After wireframes, before build
Evaluative (quant) At what rate does this work? Unmoderated tests, A/B, surveys, analytics 30-100+ 1-3 weeks After build, before scale
Continuous What changed? Ongoing interviews, support ticket review, NPS verbatims 4-6 per month Ongoing Post-launch, every cycle

Use the table as a forcing function. Before you spec a study, write down the decision the team is about to make. If the decision is "should we build this?", you need generative. If it's "should we ship the version we built?", you need evaluative. If it's "should we iterate on the version we shipped?", you need continuous. Most "we need user research" requests are actually one of these three, and the requester usually doesn't know which.

The 5-user usability test — Nielsen's actual rule, not the meme

Jakob Nielsen's 1993 paper found that 5 users are enough to surface roughly 85% of usability issues in a study. That number got compressed into "always test with 5" and has been quoted by people who haven't read the paper for thirty years. The 5-user rule holds under specific conditions, and those conditions are narrower than most product teams act like.

The rule applies when you have one user segment, doing one task, on one interface. A new user signing up. An admin configuring a single setting. An end user filing one expense report. Inside that scope, the math holds: by user 5 you've seen most of the issues, and by user 8 you're seeing repeats.

The rule breaks the moment any of those conditions break:

  • Multiple personas. If your B2B product has admins, end users, and IT, you need 5 from each. That's 15 sessions, not 5. Admins get confused by things end users never see.
  • Conceptual issues, not interactional ones. The 5-user rule catches "I couldn't find the button." It misses "I don't understand why this feature exists." Conceptual misunderstanding shows up at low rates per user but matters more — you need 12-15 to spot it reliably.
  • Branching workflows. A linear signup flow tests fine with 5. A workflow with 6 conditional branches needs sample coverage on each branch. You can run 5 users and watch 4 of them never hit the branch where the bug lives.
  • Cross-segment comparison. If the question is "does this work for SMB and enterprise?", you need a study sized for a segment-level comparison, which is 8-12 per segment minimum.
Scenario Recommended n Why
Single persona, single task, polished prototype 5 Classic Nielsen — holds well
Two personas (admin + end user) 8-10 (5 per persona) Personas surface different issues
Three personas (admin, end user, IT) 12-15 Diminishing returns, but coverage matters
Conceptual comprehension test 12-15 Conceptual issues hit at lower rates
Branching workflow with 4+ paths 12-20 Need coverage on each branch
SMB vs enterprise comparison 16-24 (8-12 per tier) Segment-level claims need segment-level n

When a stakeholder says "let's just do 5," ask which persona, which task, and what decision the result will inform. If the decision is "ship or don't ship across our whole user base," 5 isn't enough. If the decision is "is this signup flow broken for new SMB users specifically," 5 might be plenty.

Unmoderated testing — Maze, UserTesting, Lyssna, and what they hide

Unmoderated platforms have changed how fast a designer can run an evaluative study. Maze, UserTesting, and Lyssna let you ship a prototype to 50 testers and have results by Friday. Speed is real. So is the cost.

Unmoderated testing wins on three things: speed (24-72 hour turnaround), reach (you can recruit panels you'd never get on a moderated call), and quantitative comparison (A/B two designs at scale). For clear, low-context tasks aimed at a broad audience, it's hard to beat.

It loses on three things, and B2B products run into all three:

  1. Complex workflows. Moderated studies let you watch a user think out loud, ask "why did you click that?", and probe when they get stuck. Unmoderated, you get a video of someone clicking the wrong thing in silence and moving on. You learn that they failed. You don't learn why.
  2. Jargon-heavy interfaces. B2B products are full of terms users only understand in context. An unmoderated tester from a panel will guess, fail, and rate the test "easy" because they don't know what they don't know. The test produces clean-looking data and silent comprehension gaps.
  3. The "why" question. Anything that requires understanding intent, motivation, or trade-off reasoning needs a moderator. Unmoderated tools have improved at follow-up prompts, but a recorded follow-up question gets a rehearsed answer. A live moderator gets the real one.

Realistic completion rates back this up. For consumer-facing tasks, unmoderated B2C tests run 80-95% completion. For B2B SaaS workflows, expect 60-75%. That gap matters when sizing: a study that needs n=20 valid sessions on a B2B workflow needs 28-32 starts to hit it. Plan for the dropout.

Decision being made Better fit
Does this signup flow work for new SMB users? Unmoderated (clear task, broad reach)
Why aren't enterprise admins adopting the new permissions UI? Moderated (need "why," not "did")
Which of these two pricing pages converts better? Unmoderated A/B at scale
How do power users actually use the bulk actions feature? Moderated or contextual inquiry
Quick comprehension check on new microcopy Unmoderated, n=20-30
Cross-team workflow that touches admin, manager, and IC Moderated, all three personas

The trap most teams fall into: they pick unmoderated because it's fast, then make decisions that needed moderated depth. Maze tells you the click-through rate. It doesn't tell you that 4 of 12 SMB users had no idea what "tenant" meant in your IT settings panel and just guessed at the right answer.

The "research showed users want X" trap

This is where most B2B research dies. Selection bias, recency bias, and confirmation bias compound, and a study with eight participants becomes a roadmap built for an audience you don't have.

Selection bias comes from who answers the call. Customer success picks the engaged customers because they answer email. Engaged customers are more likely to be enterprise, more likely to be admins, and more likely to want features that make their existing workflow more powerful. Sample 8 of them and you'll hear a unified message: more permissions, more roles, more enterprise-grade controls. If your business is 70% SMB by ARR, you just ran a study aimed at the 30% who didn't need help anyway.

Recency bias comes from the last loud customer. A QBR happened last week, the VP heard from a $400K account, and "users want SSO" entered the planning doc as a quote with no n attached. By the time research is recruited, the question being asked isn't "do users want SSO?" — it's "how badly do users want SSO?", and the recruitment screens for users who'd find SSO valuable.

Confirmation bias is in the questions themselves. "Would it be useful if you could bulk-export reports?" gets a yes from almost everyone. "How do you currently handle reporting?" tells you whether bulk export is a real bottleneck or a nice-to-have buried under the ten things they actually struggle with daily.

A real example, anonymized: a study of 8 admins from 3 enterprise accounts produced the headline "users want SSO." The roadmap shifted to a quarter of SSO and SCIM work. Six months later, SMB churn ticked up because the team that owned activation had been pulled onto the SSO project. The 8 admins were happy. The 1,400 SMB accounts who never made it past week 2 of activation didn't get talked to. The study wasn't wrong about the 8 admins. It was wrong about being treated as a study about "users."

The defense is procedural. Before recruiting, write down: which segment is this study about, what proportion of revenue does that segment represent, and what decisions can this study legitimately inform? If the answer to the third question is "only decisions about this segment," put that on the cover slide of the readout. Stakeholders will overgeneralize unless you make the scope explicit.

How to write a finding that changes a decision

Most research findings die on the slide they're written on. "Users were confused by the export button" loses every argument because it has no n, no segment, no specificity, and no recommendation. It can be true and still get ignored.

A finding that survives a skeptical PM has five parts:

  1. Observation: what happened, behaviorally
  2. Evidence: n, segment, task, study type
  3. Inference: what this likely means
  4. Recommendation: what to do about it
  5. Confidence level: how sure you are

Compare:

Bad: Users were confused by the export button.

versus:

Good: 6 of 8 admins (n=8, enterprise tier, moderated usability test, weekly-report export task) abandoned the export workflow at the format-selection step. Three said out loud they didn't know what "delimited" meant; two clicked the wrong format and didn't notice. Inference: the format selector is a comprehension barrier, not a discoverability one. Recommendation: default to CSV with a "more formats" disclosure, ship behind a flag and measure abandonment delta. Confidence: medium. Sample is small and enterprise-only; a 2-week unmoderated follow-up across SMB would confirm.

The second one wins because it tells the PM exactly what's known, what's inferred, and what action follows. The skeptical PM can attack any of the five parts, but they have to attack a specific part. They can't just say "small sample" and walk away, because the recommendation already accounts for it with a follow-up plan.

A second pattern that helps: lead findings with the decision they affect, not the methodology. "We need to change the export default" lands. "We ran a moderated study with 8 admins" puts the audience to sleep before the recommendation arrives.

Presenting research to a skeptical PM

A 23-slide research deck handed to a PM in standup will not change a roadmap. It will get acknowledged, filed, and ignored. PMs are decision-makers under time pressure. Research has to meet them where they are.

Five things that actually work:

Lead with the decision. Open with "this study informs whether we ship the new export flow as-is, ship with one change, or rebuild." Then the methodology, then the findings. The PM is now reading to make a decision, not reading to evaluate research.

Pre-mortem the pushback. Before you present, write down the three objections you expect ("small sample," "those aren't our ICP," "we already decided"). Address each in the deck before it gets raised. Saying "n=8 is small for cross-segment claims, which is why this study only speaks to enterprise admins doing the export task" disarms the small-sample attack before it lands.

Bring the raw clip, not the summary. A 45-second video of an admin staring at the format dropdown and saying "I have no idea what any of these mean" is worth fifteen quote slides. PMs trust their own eyes more than your synthesis. Tools like Dovetail and UserTesting make pulling clips fast.

Anchor to a metric the PM already cares about. If the PM owns activation, frame findings around activation impact. If they own retention, frame around retention. "This export friction touches week-2 activation for new admins" beats "this export friction is a UX issue" every time.

Name the cost of ignoring it. "If we ship as-is, we expect roughly 30-40% of new admins to abandon the export task in the first month, based on the moderated session abandonment rate" gives the PM something to weigh against ship date.

A practical rule: if a research finding can't fit on a single slide with the decision, evidence, recommendation, and one quote, it's not a finding yet. It's a notebook entry. Keep working on it before it goes to the room.

What to do this week

Pick one upcoming roadmap decision. Write down what's being decided, who's deciding it, and what would have to be true for the decision to flip. Then ask: does the team have evidence for the thing that would have to be true? If not, that's the study. If yes, the research has already been done and the next step is to surface it.

Research that moves the roadmap is research aimed at a specific decision, sized for the claim it needs to support, and presented in a form a busy PM can act on. Everything else is a workshop. Workshops are fine. Just don't confuse them with studies.

Learn More