← Insights & Guides · 8 min read

How to Run a Usability Test: The 8-Step Process

By

Most product teams agree usability testing is the highest-leverage research activity they can run. A 30-minute session watching a real user struggle through your checkout flow is worth more than a week of internal debate about whether the checkout is confusing. Despite that, usability testing is the research method teams most often skip.

The reason isn’t disagreement about its value. It’s that the standard process feels heavyweight relative to the pace of product work. Recruit eight participants. Coordinate calendars. Moderate sessions live. Transcribe. Synthesize. Build a deck. Three weeks of elapsed time for findings the team needed before the last sprint planning meeting, not after.

This post walks through the eight-step usability testing process as it’s traditionally practiced, then shows where modern AI-moderated research compresses the production overhead so the team can actually run the tests they intended to run.

Step 1: Define the research question and the decision

The first step is the one most teams under-invest in. Before recruiting a single participant, write down two things:

  • The research question. What do you actually want to know? “Is the new checkout flow usable” is too vague — usable for whom, doing what, under what conditions? Better: “Can a first-time user complete signup, payment, and first booking within 5 minutes, on mobile, without help?”
  • The decision the test will inform. What will the team do differently based on what the study finds? If the answer is “we’ll think about it,” the study isn’t ready to run. Tests that inform a real upcoming decision — ship vs hold, A vs B, scope-in vs scope-out — earn their elapsed time. Tests that don’t, won’t.

This step is research craft. It can’t be shortcut by tooling. The good news is it takes 30 minutes of clear thinking, not weeks.

Step 2: Identify the user segment and recruit

Recruitment is the largest determinant of usability-test quality. The most rigorously designed study fails if the wrong people show up.

Three recruitment paths:

  1. Import your own customer list. Best for evaluating existing flows with current users. CRM integration (Salesforce, HubSpot) lets you target by segment, lifecycle stage, or product usage. Downside: existing users are often too forgiving of familiar friction, and you can’t recruit for first-time experience.
  2. Use a built-in research panel. Most modern remote-testing platforms include a vetted panel of pre-screened participants. Fastest path: recruitment goes from weeks to hours. Panel quality varies — the strongest run multi-layer fraud prevention and active quality scoring.
  3. Specialist recruitment agency. Necessary for hard-to-reach segments like enterprise IT buyers, licensed clinicians, or regulated populations. Cost 5-10x panel recruitment, timelines stretch to weeks. Last resort.

Screen for the dimensions that actually matter: role/seniority, prior product familiarity, device type, geographic market. Device matters more than most teams plan for — a desktop study with 40% mobile participants produces muddled findings.

Step 3: Design tasks and success criteria

Tasks should be realistic, scenario-grounded, and unambiguous. Three rules of thumb:

  • Scenario before instruction. “You’re booking a flight for a work trip next week and your manager has asked you to keep it under $500” works better than “find a flight under $500.” The scenario gives the participant context for their decisions; bare instructions produce robotic behavior that doesn’t predict real use.
  • Define success criteria up front. What does completion look like? What signals failure? Pre-commit to the criteria before running sessions, so post-hoc rationalization doesn’t reframe a failed task as a success.
  • Sequence from broad to narrow. Start with open-ended exploration (“show me how you’d plan a trip”), then move to specific tasks (“now book the flight from your search results”). Open-ended first surfaces the natural workflow; specific-first locks participants into your hypothesis.

Four to six tasks is the sweet spot for a 30-45 minute session. More than six and participants fatigue; fewer than four and you don’t surface enough signal.

Step 4: Choose methodology

Three axes determine the methodology:

  • Moderated vs unmoderated. Moderated sessions have a live facilitator who probes hesitation; unmoderated sessions are async with the platform recording solo participants. Moderated gives diagnostic depth; unmoderated gives scale.
  • Remote vs in-person. Remote is now the default. In-person testing makes sense for physical products, on-site enterprise workflows, or when observation of body language matters more than what the participant says.
  • Behavior-only vs behavior + reasoning. Behavior-only tools capture click paths and time-on-task without explanation. Behavior + reasoning tools layer think-aloud narration on top so you hear what the participant was trying to do when something went wrong.

For most product teams running discovery on a digital flow, the answer is remote, moderated (or AI-moderated), with reasoning capture. Diagnostic depth is what makes usability data worth the elapsed time.

Step 5: Pilot with 1-2 participants and refine

Before running a full study, pilot with one or two participants. Two-thirds of the issues you find won’t be issues with the product — they’ll be issues with the study design.

Common pilot findings:

  • Task instructions that confused the participant in ways you didn’t anticipate
  • Screener questions that filtered out a qualified participant (or let an unqualified one through)
  • Time estimates that were off by 50% or more
  • Probes that felt leading or that participants answered too quickly to be useful

Refine, then run the rest. Skipping the pilot is the most common failure mode in usability testing — a 30-minute pilot prevents a week of unusable data.

Step 6: Run the study

The actual session-running phase is where elapsed-time costs accumulate. For traditional moderated remote testing, a senior facilitator runs 4-6 sessions per day before fatigue dulls probing quality. Time-zone spread, no-shows, reschedules — three weeks of facilitator calendar for an 8-session moderated study is normal.

The mechanics of the session itself:

  • Open with a 2-3 minute warm-up to put the participant at ease and confirm the recording is working
  • Present the scenario, then the first task. Resist the urge to clarify — ambiguity is data.
  • Probe hesitation. “What are you thinking?” “What did you expect to happen?” “Why did that label feel confusing?”
  • Don’t lead. “Did you notice the menu at the top?” is a leading question that contaminates the rest of the session.
  • Close with retrospective questions: “If you were redesigning this, what would you change first?”

Recording quality matters. A muffled session is a wasted hour for everyone. Confirm audio and screen capture before starting every session.

Step 7: Synthesize findings

Synthesis is the second-largest time sink after recruitment. Traditional workflow: rewatch every session, take notes, organize notes into themes, pull representative quotes, count behavioral signals across sessions.

The synthesis output the team actually needs:

  • Top 3-5 themes ranked by severity and frequency
  • Verbatim quotes that illustrate each theme — quotes carry persuasive weight that paraphrase doesn’t
  • Behavioral signals — completion rates, time-on-task by step, count of participants who hit each friction point
  • Recommendations mapped to specific design decisions, not generic “improve clarity”

Searchable transcripts are the difference between this taking days and taking hours. If you can grep for “I’m not sure where,” you can find every hesitation moment across 50 sessions in seconds. If you have to scrub through video, you spend the week.

Step 8: Present and route to product decisions

Findings that don’t route to a decision are entertainment. The handoff from research to product is where most usability studies break — synthesized findings sit in a Notion doc the PM glances at once and never returns to.

Three patterns that work:

  • Tie every finding to a specific upcoming decision the product team is making. “Ship the new checkout as-is, or hold for revisions” is a decision a finding can inform. “Make the product better” is not.
  • Quantify the impact where possible. “Six of eight participants couldn’t find the discount code field, increasing completion time by 40%” is decision-ready. “Some participants had trouble with the discount code” is not.
  • Embed in the workflow. Drop findings into the design tool (Figma comments on the relevant frames), the sprint backlog (linked tickets), or the PRD — wherever the team is making the next decision. Don’t make the team come to your research doc.

How AI moderation collapses the heavyweight process

Steps 2, 4, 6, and 7 — recruit, methodology, run, synthesize — are the production work that makes traditional usability testing feel heavyweight. Steps 1, 3, 5, and 8 are research craft and don’t compress; they require human judgment about the question, the tasks, the pilot findings, and the decisions the team is making.

AI moderation is the lever on the production half:

  • Recruit collapses from weeks to hours when sessions run against a built-in vetted panel.
  • Methodology decision becomes moot when one mode captures both behavioral signal and reasoning at scale.
  • Run collapses from weeks to a day or two when sessions execute asynchronously and in parallel rather than against facilitator calendar.
  • Synthesis collapses from days to hours when transcripts are searchable and themes surface as sessions complete.

The team still does the research thinking. The platform handles the production overhead that’s been making teams skip usability testing entirely.

How does User Intuition run usability tests?

User Intuition runs AI-moderated usability tests as interactive walkthroughs — participants navigate a Figma prototype or live URL on their own device while an AI moderator asks follow-up questions in real time. When a participant hesitates, takes an unexpected path, or expresses frustration, the AI moderator probes with the same conversational follow-ups a skilled human facilitator would ask: what were you trying to do, what did you expect to happen, why did that label feel confusing.

Each session captures the behavioral signal of an unmoderated test (click paths, hesitation patterns, completion rates) and the reasoning depth of a moderated test (verbatim explanations, mental-model gaps, friction sources) in a single recording. Studies recruit from a 4M+ vetted global panel across 50+ languages, with results in 24-48 hours starting at $200 per study. Synthesis surfaces themes, quotes, and decision-ready findings as sessions complete — no rewatching video, no week-long write-up phase.

The methodology decisions that used to require dedicated UX research operations — screener generation, panel recruitment, session moderation, transcript synthesis, findings packaging — are handled by the platform. Product teams focus on the parts of usability testing that don’t compress: defining the research question, designing the tasks, and routing findings to product decisions.

See the usability testing platform overview for the full capability, or the user research solutions page for use-case framing.

Bottom line

The eight-step usability testing process isn’t broken — every step earns its place. The problem is the elapsed time that the production half (recruit, run, synthesize) adds on top of the research half (define, design, decide).

For most product teams in 2026, the practical decision isn’t whether to follow the eight-step process. It’s whether to run it the traditional way (three weeks per cycle, 5-8 participants per round) or to compress the production half with AI moderation (24-48 hours per cycle, 50-100 participants per round). The research craft is the same. The production overhead isn’t.

Start small: pick one upcoming product decision, write the research question, design four tasks, and run a 10-session pilot. You’ll learn more about whether AI-moderated usability testing fits your team in 48 hours than you will in another month of debating the question.

Run your first usability study →

Note from the User Intuition Team

Human moderation, done well, is the gold standard. A skilled moderator reads silence, follows a half-thought, knows when to push and when to wait. The trouble is what that costs at scale: one moderator, one participant, one hour at a time — and by interview a hundred, even the best aren't asking the same questions they asked at interview one.

User Intuition keeps what makes great moderation great — the depth, the laddering, the patient probing — and removes what holds it back. The AI moderator ladders 5–7 levels deep on every interview, with no fatigue wall and no calendar to manage. It runs hundreds of conversations in parallel, so a study fills in hours instead of weeks. Setup takes five minutes: upload your study guide and we turn it into a plan, write the screener, recruit from our 4M+ panel, and launch. Every interview is automatically scored on Length, Depth, and Coverage; if it doesn't pass, you don't pay. No refund required.

Preview a real study output before you pay — the only platform in the industry that lets you evaluate the work first. A 10-interview study lands at $200 in 24–48 hours. Already convinced? Sign up and try with 3 free quality interviews.

Frequently Asked Questions

The standard usability testing process has eight steps: (1) define the research question and the decision it will inform, (2) identify the user segment and recruit, (3) design tasks and success criteria, (4) choose methodology (moderated vs unmoderated, remote vs in-person, behavior-only vs behavior + reasoning), (5) pilot with 1-2 participants and refine, (6) run the study, (7) synthesize findings into themes, quotes, and behavioral signal, and (8) present and route to product decisions. Steps 1, 3, 5, and 8 are research craft and can't be shortcut. Steps 2, 4, 6, and 7 are production work — the part where AI moderation collapses weeks of effort into hours.
For diagnostic discovery on a single segment, 5-8 participants surfaces approximately 85% of major usability issues — Jakob Nielsen's classic finding that still holds in remote contexts. For quantitative usability metrics like SUS scores, completion rates, or comparing design variants, 30+ participants per segment is the typical floor below which confidence intervals overlap too much to support claims. AI moderation makes 50-100 sessions per study economical without breaking the budget that historically capped moderated testing at 5-8.
Traditional moderated usability tests take 2-3 weeks of elapsed time: 3-5 days to recruit, 5-7 days to schedule and run sessions across participant availability, 3-5 days to transcribe and synthesize, and 2-3 days to package findings. The recruit-and-schedule phase is the largest bottleneck for live moderated work. AI-moderated remote testing on a built-in panel collapses this to 24-48 hours: recruitment runs against an existing vetted panel, sessions run asynchronously and in parallel, and synthesis happens against searchable transcripts as sessions complete.
Moderated usability testing has a live facilitator on the call who probes hesitation, asks follow-up questions, and adapts to what the participant says. It produces deep diagnostic findings but caps at 5-8 sessions per round because of facilitator throughput. Unmoderated testing is asynchronous — the participant completes tasks alone while the platform records — so it scales to 50-200 sessions but loses the real-time probing that makes usability data diagnostic. AI moderation collapses the tradeoff: it scales like unmoderated and probes like moderated.
User Intuition runs AI-moderated usability sessions on Figma prototypes or live URLs. Participants complete tasks asynchronously on their own devices while an AI moderator asks follow-up questions in real time when they hesitate, take an unexpected path, or express frustration. Each session delivers behavioral signal and reasoning in one recording; synthesis surfaces themes, quotes, and decision-ready findings as sessions complete. Studies recruit from a 4M+ vetted global panel across 50+ languages, with results in 24-48 hours starting at $200 per study.
Get Started

Put This Framework Into Practice

Sign up free and run your first 3 AI-moderated customer interviews — no credit card, no sales call.

Self-serve

3 interviews free. No credit card required.

See it First

Explore a real study output — no sales call needed.

You only pay for quality interviews.

Every interview is automatically scored against your brief. Misses aren't charged.

No contract · No retainers · Results in 72 hours