← Reference Deep-Dives Reference Deep-Dive May 27, 2026 · 9 min read

IDI Sample Size and Saturation: How Many In-Depth Interviews Do You Need?

By Kevin, Founder & CEO

TL;DR

The most common in-depth interview (IDI) planning question is "how many?" — and the textbook answer (Nielsen's 5-8 for usability discovery, Guest, Bunce and Johnson's 2006 finding of saturation around 12 participants, Hagaman and Wutich's 2017 16-24 threshold for cross-cultural code saturation) was developed under cost regimes where each IDI ran hundreds of dollars and weeks of moderator calendar. That made fewest-possible-interviews the rational objective, so the saturation literature reads as a diminishing-returns frontier. The frontier breaks once cost falls. For single-segment exploratory work the textbook numbers still hold; for segment-level analysis, cross-cultural studies, niche populations, and longitudinal designs they understate the real sample by a factor of three to ten. The right question in 2026 is no longer what is the minimum but what does saturation look like per segment when each interview is no longer the binding constraint. User Intuition changes that calculus by running AI-moderated IDIs at a unit cost that finally makes segment-level saturation economical instead of aspirational.

Every IDI project starts with the same planning question: how many interviews do we need? It is the most common methodology question in qualitative research, and the textbook answers — five to eight, twelve, sixteen to twenty-four — get cited in plans, decks, and vendor proposals as if they were universal constants.

They are not. Each of those numbers comes from a specific empirical study run under a specific cost regime, and the cost regime is the part of the citation that almost always gets dropped. Once you put the cost regime back, the textbook numbers stop reading as “the right answer” and start reading as “the answer that was rational when each IDI cost hundreds of dollars and weeks of moderator calendar.”

This guide walks through the actual saturation literature, where the textbook numbers come from, when they still hold, when they break, and how to size IDI samples in a cost regime where the per-interview constraint no longer dominates the design.

What saturation actually means in qualitative research

Saturation is the methodological justification for stopping a qualitative study at N participants. The claim is that interview N+1 would add diminishing marginal information — no new themes, codes, or insights would emerge — so the study has captured the underlying conceptual structure of the segment.

Saturation is segment-specific. Each distinct subgroup in a study has its own path to saturation: loyalists and switchers, enterprise and SMB buyers, new and tenured users, native English speakers and non-native speakers, urban and rural participants. Conflating segments inflates the apparent saturation rate (because diverse participants produce more codes early) and then hides systematic gaps in each subgroup later.

Two flavors get used interchangeably in plans but mean different things:

Thematic saturation — the point where no new major themes emerge. This is what most “12 interviews is enough” claims actually measure.
Code saturation — the point where the full code structure (themes plus sub-themes, exceptions, edge cases) stabilizes. This is a stricter bar and typically requires more interviews than thematic saturation.

The difference matters for IDI planning. A study that needs to surface the major drivers of churn can reach thematic saturation faster than a study that needs to map the full code structure of how those drivers vary across segments.

Where the textbook numbers come from

Three studies anchor the IDI sample-size conversation in qualitative methodology. Each is solid empirical work; each was run under a specific scope condition that often gets dropped when the number gets cited.

Nielsen’s 5-8: a usability finding misapplied to IDIs

Jakob Nielsen’s 1993 paper “Why You Only Need to Test with 5 Users” analyzed usability evaluation, not in-depth interviewing. The finding: across a sample of usability studies, five participants surfaced approximately 85% of major usability problems on a single product flow. The mathematics work because usability errors tend to be common enough that small samples catch the high-frequency ones.

The 5-8 number became the default for IDI planning by adjacency rather than empirical transfer. Usability discovery (finding friction in a single interface) and depth interviewing (probing reasoning, identity, and emotional drivers) are methodologically different activities with different signal distributions. Reasoning structure is more variable across participants than usability friction, so the small-sample logic that works for the latter does not transfer cleanly to the former.

Where the Nielsen number does hold up for IDIs: exploratory diagnostic work on a single segment, where the goal is to surface major reasoning patterns rather than measure their distribution. For most diagnostic-discovery IDI studies on a single user segment, 5-8 is a reasonable floor.

Guest, Bunce and Johnson’s 12: a single-segment ceiling, not a universal

The most-cited empirical work on IDI saturation is Guest, Bunce and Johnson’s 2006 paper “How Many Interviews Are Enough?” — a methodological analysis of 60 in-depth interviews with women in Ghana and Nigeria. The headline finding: 12 interviews captured roughly 92% of the codes that eventually emerged across all 60. The first six interviews captured about 73%.

The scope conditions that often get dropped when “12” gets cited:

The sample was deliberately homogeneous — women in two specific countries discussing a single topic (reproductive health).
The analysis measured thematic saturation, not code saturation. Sub-themes and edge cases stabilized later than the major themes.
The original paper explicitly cautions that the 12-interview threshold applies to “studies with relatively homogeneous samples and narrowly defined research questions.” Multi-segment studies were out of scope.

Used inside its scope, 12 is a strong number. Used outside its scope — as the default for any IDI study regardless of segmentation — it systematically under-sizes research designs that include cross-group comparison.

Hagaman and Wutich’s 16-24: cross-cultural code saturation

Ashley Hagaman and Amber Wutich’s 2017 paper extended the saturation question to cross-cultural research. Their finding: in cross-cultural studies with two or more cultural cohorts, code saturation typically requires 16-24 interviews per cohort. Below 16, code structure remained unstable; beyond 24, additional interviews produced diminishing structural returns.

This is the closest thing in the literature to a per-cell number for IDIs that involve subgroup comparison, and it generalizes reasonably to non-geographic segmentation: each distinct cultural, demographic, or behavioral cohort in a study can be treated as needing its own 16-24 floor for code saturation. A study that compares three cohorts therefore needs roughly 48-72 IDIs before the cross-cohort code structure stabilizes — far above the headline “12” that anchors most IDI planning conversations.

The cost regime the literature was written in

Every saturation paper cited above was written under the same implicit cost regime: each IDI was expensive to recruit, schedule, conduct, transcribe, and analyze. A skilled moderator could run four to six IDIs per day before fatigue eroded probing quality. Recruiting for niche segments took weeks; recruiting across cultures took months. Transcription was manual. Coding was manual.

In that cost regime, the rational research-design objective was minimum-defensible-sample. Saturation papers are useful precisely because they give researchers a methodological justification for stopping early. Every paper in the saturation literature is implicitly answering the same question: “Given that each additional IDI costs us hundreds of dollars and weeks of calendar, what is the smallest sample I can defend?”

That framing produces saturation curves drawn as diminishing-returns frontiers — code-yield on the y-axis, sample size on the x-axis, with researchers being trained to stop at the point where the curve flattens. The frontier is real; the part that is often unexamined is that the optimal stopping point on a diminishing-returns frontier is determined by the cost of the next sample, not by the curve itself.

What the textbook numbers miss

The diminishing-returns frame works inside the cost regime that produced it. Outside that regime, three failure modes show up consistently in research designs that anchored on the textbook minimums:

Segment-level analysis fails. Studies that need to compare loyalists vs. switchers, enterprise vs. SMB, new vs. tenured customers consistently under-sample each cell. A “30-interview cross-segment study” with three segments has 10 interviews per cell — well below even the single-segment thematic-saturation threshold of 12, let alone the cross-cohort code-saturation threshold of 16-24.
Cross-cultural research under-recovers structure. International research designs that apply a 12-interview floor across three or four cultural cohorts surface only the most prevalent themes in each cohort and miss most of the variation in how those themes are expressed. The result is generic findings that lose what makes each market distinctive.
Niche populations under-cover edge cases. Hard-to-reach populations (specific patient cohorts, enterprise IT buyers, regulated-industry professionals) have higher within-segment heterogeneity than convenience samples. The 12-interview rule of thumb that works for general consumer populations breaks down because the code structure stabilizes later.

A fourth failure mode, less visible, is longitudinal: a wave of 12 IDIs run quarterly across a year is 48 interviews total, but the design assumes saturation within each wave. When the wave-level sample is below threshold, the longitudinal signal becomes noisy enough that real shifts between waves are indistinguishable from sampling variation.

When the textbook numbers are still right

The point is not that the textbook numbers are wrong — it is that they are accurate inside their original scope conditions and inaccurate outside. Specifically:

Single homogeneous segment, single research question, exploratory phase. A study validating that a new positioning concept lands with one buyer persona, or surfacing the headline reasons users abandon a single onboarding flow, can run on 5-12 IDIs and reach thematic saturation reliably.
Methodology pilots. Studies where the goal is to refine the interview guide before a larger study can run on 5-8 IDIs without methodological concern.
High-fidelity, low-comparison designs. Studies where the goal is a small number of rich, deeply analyzed individual cases (often labeled “phenomenological” or “narrative” research) intentionally sample small. Saturation is not the binding constraint; depth per case is.

For these designs the saturation literature applies cleanly. The problem arises when teams default to the same numbers for designs that are not in any of these categories.

A decision framework for IDI sample size

Three steps replace the “12 is enough” default:

Count the cells. Segments × dimensions × waves. A study comparing three buyer personas across two regions in a single wave has six cells. A longitudinal study comparing the same six cells across four waves has 24 cells.
Apply per-cell saturation. For single-segment exploratory work, 5-12 per cell. For thematic comparison, 12-20 per cell. For code-saturation work and cross-cultural designs, 16-24 per cell. The numbers come from the saturation literature; the multiplier comes from your design.
Buffer for attrition. Screening attrition, no-shows, and quality-rejected interviews typically run 10-20% above the target completion rate. Plan accordingly.

The output is a sample-size floor that reflects the research design, not a budget anchor. If the resulting number is uncomfortably large under the cost regime you are budgeting against, that is a signal about the cost regime, not the design.

How does User Intuition change the sample-size calculus for IDIs?

Traditional IDI economics — hundreds of dollars per interview, weeks of moderator calendar, single-threaded recruiting, manual transcription, manual coding — forced researchers to default to minimum-defensible-sample. The saturation literature emphasizes minimums because minimums were what teams could afford. Segment-level saturation, cross-cultural code saturation, longitudinal wave-level saturation: all of these are methodologically standard but economically aspirational under traditional fieldwork.

User Intuition runs in-depth interviews AI-moderated, in parallel, across unlimited concurrent sessions. The unit cost is roughly an order of magnitude below traditional moderated IDI work, the moderator-calendar bottleneck disappears, and structured laddering five to seven levels deep happens on every conversation. A study with five segments at fifteen IDIs per cell — 75 interviews — runs in days rather than months, on the budget that traditional fieldwork would spend on a single twelve-IDI exploratory wave.

What this changes in practice:

Per-segment saturation becomes the default, not a stretch goal. Studies that previously had to pick one or two segments to interview deeply can now cover all the segments the research design calls for.
Cross-cultural code saturation is reachable. Sixteen to twenty-four IDIs per cohort across four or five cohorts is a single sprint, not a six-month international fieldwork program.
Longitudinal sample sizes hold up. Wave-over-wave comparisons can be sized to detect real shifts rather than noise.
Research-design choices replace budget choices. The IDI sample-size conversation moves from “what is the minimum we can justify?” to “what does the question actually need?”

See the user research solutions page for how this shows up across exploratory, segmentation, and longitudinal designs.

Bottom-line guidance

The textbook IDI sample sizes — 5-8, 12, 16-24 — are correct inside the scope conditions of the studies that produced them and incorrect when applied outside those conditions. Single homogeneous segment with a single research question: the textbook numbers hold. Anything segmented, cross-cultural, niche, or longitudinal: multiply per cell, do not divide across cells.

The deeper shift is that the saturation literature was always implicitly answering the question “what is the smallest sample I can defend given how expensive each interview is?” The right question once interviews are no longer the binding constraint is “what does saturation look like per segment when the design, not the budget, sets the sample?”

Start with the cells, apply per-cell saturation, buffer for attrition, and let the research design pick the number.

See the platform in action →

Note from the User Intuition Team

Human moderation, done well, is the gold standard. A skilled moderator reads silence, follows a half-thought, knows when to push and when to wait. The trouble is what that costs at scale: one moderator, one participant, one hour at a time — and by interview a hundred, even the best aren't asking the same questions they asked at interview one.

User Intuition keeps what makes great moderation great — the depth, the laddering, the patient probing — and removes what holds it back. The AI moderator ladders 5–7 levels deep on every interview, with no fatigue wall and no calendar to manage. It runs hundreds of conversations in parallel, so a study fills in hours instead of weeks. Setup takes five minutes: upload your study guide and we turn it into a plan, write the screener, recruit from our 4M+ panel, and launch. Every interview is automatically scored on Length, Depth, and Coverage; if it doesn't pass, you don't pay. No refund required.

Preview a real study output before you pay — the only platform in the industry that lets you evaluate the work first. A 5-interview study lands at $150 in 24 hours. Already convinced? Sign up and try with 3 free quality interviews.

Frequently Asked Questions

Saturation is the point in a qualitative study where additional interviews stop revealing new themes, codes, or insights. It's the methodological justification for stopping at N participants — the claim that interview N+1 would add diminishing marginal information. Guest, Bunce and Johnson's 2006 study is the most-cited empirical work on the threshold: in a homogeneous sample of 60 women in two African countries, 12 interviews captured roughly 92% of the codes that eventually emerged, and the first six interviews captured 73%. Saturation is segment-specific, though — each distinct subgroup in a study needs its own path to saturation, which is why cross-segment designs need far more interviews than the headline 12 number suggests.

Jakob Nielsen's 1993 'Why You Only Need to Test with 5 Users' analyzed usability evaluation, not in-depth interviews. The finding: 5 users surface roughly 85% of major usability problems on a single product flow, because usability errors are common enough that small samples catch most of them. The 5-8 number became a default for IDI planning by adjacency, even though usability discovery (looking for friction in a single interface) is methodologically different from depth interviewing (probing reasoning, identity, and emotional drivers). For exploratory diagnostic IDIs on a single segment the number transfers reasonably well; for anything segmented or comparative it doesn't.

Three conditions: a single homogeneous segment, a single well-defined research question, and a study where the team is comfortable trading statistical confidence for speed. Examples: validating that a new product positioning is intelligible to one buyer persona; exploring early reactions to a single design concept before broader testing; methodology pilots where the goal is to refine the interview guide rather than produce reportable findings. Outside those conditions — multiple segments, cross-cultural comparisons, niche populations, longitudinal designs, regulated industries that demand audit trails — the textbook minimums systematically understate what the research design actually needs.

Start with the research design, not the budget. Count the distinct cells you need to compare: segments × dimensions × waves. For each cell, plan 12-20 IDIs to reach saturation within that subgroup. Add a 10-15% buffer for screening attrition. The result is a floor, not a ceiling — if early interviews reveal more heterogeneity within a segment than expected, the segment splits and the sample grows. For exploratory single-segment work, 5-12 is usually enough. For segment-level analysis across 3-5 cells, 60-100 is the floor. For cross-cultural research with code saturation, plan 16-24 per cultural cohort.

Traditional IDI economics (hundreds of dollars per interview, weeks of moderator calendar, single-threaded recruiting) forced researchers to pick the smallest sample they could methodologically defend, which is why the saturation literature emphasizes minimums. User Intuition flips that constraint by running AI-moderated IDIs in parallel at a unit cost that makes segment-level saturation routine rather than aspirational. A study with five segments at 15 IDIs per cell — 75 interviews — runs in 24 hours instead of three months, on the same budget that traditional fieldwork would spend on a single 12-IDI exploratory wave. See [in-depth interviews](/platform/in-depth-interviews/) for the platform detail.

What saturation actually means in qualitative research

Where the textbook numbers come from

Nielsen’s 5-8: a usability finding misapplied to IDIs

Guest, Bunce and Johnson’s 12: a single-segment ceiling, not a universal

Hagaman and Wutich’s 16-24: cross-cultural code saturation

The cost regime the literature was written in

What the textbook numbers miss

When the textbook numbers are still right

A decision framework for IDI sample size

How does User Intuition change the sample-size calculus for IDIs?

Bottom-line guidance

Frequently Asked Questions

What is saturation in qualitative research, and why does it determine IDI sample size?

What did Nielsen actually say about 5-8 participants?

When are the textbook IDI sample sizes (5-12) actually right?

How do I decide IDI sample size for a real study?

How does User Intuition change the sample-size calculus for IDIs?

Related Reading

Articles

Reference Guides

Put This Research Into Action