Every IDI project starts with the same planning question: how many interviews do we need? It is the most common methodology question in qualitative research, and the textbook answers — five to eight, twelve, sixteen to twenty-four — get cited in plans, decks, and vendor proposals as if they were universal constants.
They are not. Each of those numbers comes from a specific empirical study run under a specific cost regime, and the cost regime is the part of the citation that almost always gets dropped. Once you put the cost regime back, the textbook numbers stop reading as “the right answer” and start reading as “the answer that was rational when each IDI cost hundreds of dollars and weeks of moderator calendar.”
This guide walks through the actual saturation literature, where the textbook numbers come from, when they still hold, when they break, and how to size IDI samples in a cost regime where the per-interview constraint no longer dominates the design.
What saturation actually means in qualitative research
Saturation is the methodological justification for stopping a qualitative study at N participants. The claim is that interview N+1 would add diminishing marginal information — no new themes, codes, or insights would emerge — so the study has captured the underlying conceptual structure of the segment.
Saturation is segment-specific. Each distinct subgroup in a study has its own path to saturation: loyalists and switchers, enterprise and SMB buyers, new and tenured users, native English speakers and non-native speakers, urban and rural participants. Conflating segments inflates the apparent saturation rate (because diverse participants produce more codes early) and then hides systematic gaps in each subgroup later.
Two flavors get used interchangeably in plans but mean different things:
- Thematic saturation — the point where no new major themes emerge. This is what most “12 interviews is enough” claims actually measure.
- Code saturation — the point where the full code structure (themes plus sub-themes, exceptions, edge cases) stabilizes. This is a stricter bar and typically requires more interviews than thematic saturation.
The difference matters for IDI planning. A study that needs to surface the major drivers of churn can reach thematic saturation faster than a study that needs to map the full code structure of how those drivers vary across segments.
Where the textbook numbers come from
Three studies anchor the IDI sample-size conversation in qualitative methodology. Each is solid empirical work; each was run under a specific scope condition that often gets dropped when the number gets cited.
Nielsen’s 5-8: a usability finding misapplied to IDIs
Jakob Nielsen’s 1993 paper “Why You Only Need to Test with 5 Users” analyzed usability evaluation, not in-depth interviewing. The finding: across a sample of usability studies, five participants surfaced approximately 85% of major usability problems on a single product flow. The mathematics work because usability errors tend to be common enough that small samples catch the high-frequency ones.
The 5-8 number became the default for IDI planning by adjacency rather than empirical transfer. Usability discovery (finding friction in a single interface) and depth interviewing (probing reasoning, identity, and emotional drivers) are methodologically different activities with different signal distributions. Reasoning structure is more variable across participants than usability friction, so the small-sample logic that works for the latter does not transfer cleanly to the former.
Where the Nielsen number does hold up for IDIs: exploratory diagnostic work on a single segment, where the goal is to surface major reasoning patterns rather than measure their distribution. For most diagnostic-discovery IDI studies on a single user segment, 5-8 is a reasonable floor.
Guest, Bunce and Johnson’s 12: a single-segment ceiling, not a universal
The most-cited empirical work on IDI saturation is Guest, Bunce and Johnson’s 2006 paper “How Many Interviews Are Enough?” — a methodological analysis of 60 in-depth interviews with women in Ghana and Nigeria. The headline finding: 12 interviews captured roughly 92% of the codes that eventually emerged across all 60. The first six interviews captured about 73%.
The scope conditions that often get dropped when “12” gets cited:
- The sample was deliberately homogeneous — women in two specific countries discussing a single topic (reproductive health).
- The analysis measured thematic saturation, not code saturation. Sub-themes and edge cases stabilized later than the major themes.
- The original paper explicitly cautions that the 12-interview threshold applies to “studies with relatively homogeneous samples and narrowly defined research questions.” Multi-segment studies were out of scope.
Used inside its scope, 12 is a strong number. Used outside its scope — as the default for any IDI study regardless of segmentation — it systematically under-sizes research designs that include cross-group comparison.
Hagaman and Wutich’s 16-24: cross-cultural code saturation
Ashley Hagaman and Amber Wutich’s 2017 paper extended the saturation question to cross-cultural research. Their finding: in cross-cultural studies with two or more cultural cohorts, code saturation typically requires 16-24 interviews per cohort. Below 16, code structure remained unstable; beyond 24, additional interviews produced diminishing structural returns.
This is the closest thing in the literature to a per-cell number for IDIs that involve subgroup comparison, and it generalizes reasonably to non-geographic segmentation: each distinct cultural, demographic, or behavioral cohort in a study can be treated as needing its own 16-24 floor for code saturation. A study that compares three cohorts therefore needs roughly 48-72 IDIs before the cross-cohort code structure stabilizes — far above the headline “12” that anchors most IDI planning conversations.
The cost regime the literature was written in
Every saturation paper cited above was written under the same implicit cost regime: each IDI was expensive to recruit, schedule, conduct, transcribe, and analyze. A skilled moderator could run four to six IDIs per day before fatigue eroded probing quality. Recruiting for niche segments took weeks; recruiting across cultures took months. Transcription was manual. Coding was manual.
In that cost regime, the rational research-design objective was minimum-defensible-sample. Saturation papers are useful precisely because they give researchers a methodological justification for stopping early. Every paper in the saturation literature is implicitly answering the same question: “Given that each additional IDI costs us hundreds of dollars and weeks of calendar, what is the smallest sample I can defend?”
That framing produces saturation curves drawn as diminishing-returns frontiers — code-yield on the y-axis, sample size on the x-axis, with researchers being trained to stop at the point where the curve flattens. The frontier is real; the part that is often unexamined is that the optimal stopping point on a diminishing-returns frontier is determined by the cost of the next sample, not by the curve itself.
What the textbook numbers miss
The diminishing-returns frame works inside the cost regime that produced it. Outside that regime, three failure modes show up consistently in research designs that anchored on the textbook minimums:
-
Segment-level analysis fails. Studies that need to compare loyalists vs. switchers, enterprise vs. SMB, new vs. tenured customers consistently under-sample each cell. A “30-interview cross-segment study” with three segments has 10 interviews per cell — well below even the single-segment thematic-saturation threshold of 12, let alone the cross-cohort code-saturation threshold of 16-24.
-
Cross-cultural research under-recovers structure. International research designs that apply a 12-interview floor across three or four cultural cohorts surface only the most prevalent themes in each cohort and miss most of the variation in how those themes are expressed. The result is generic findings that lose what makes each market distinctive.
-
Niche populations under-cover edge cases. Hard-to-reach populations (specific patient cohorts, enterprise IT buyers, regulated-industry professionals) have higher within-segment heterogeneity than convenience samples. The 12-interview rule of thumb that works for general consumer populations breaks down because the code structure stabilizes later.
A fourth failure mode, less visible, is longitudinal: a wave of 12 IDIs run quarterly across a year is 48 interviews total, but the design assumes saturation within each wave. When the wave-level sample is below threshold, the longitudinal signal becomes noisy enough that real shifts between waves are indistinguishable from sampling variation.
When the textbook numbers are still right
The point is not that the textbook numbers are wrong — it is that they are accurate inside their original scope conditions and inaccurate outside. Specifically:
-
Single homogeneous segment, single research question, exploratory phase. A study validating that a new positioning concept lands with one buyer persona, or surfacing the headline reasons users abandon a single onboarding flow, can run on 5-12 IDIs and reach thematic saturation reliably.
-
Methodology pilots. Studies where the goal is to refine the interview guide before a larger study can run on 5-8 IDIs without methodological concern.
-
High-fidelity, low-comparison designs. Studies where the goal is a small number of rich, deeply analyzed individual cases (often labeled “phenomenological” or “narrative” research) intentionally sample small. Saturation is not the binding constraint; depth per case is.
For these designs the saturation literature applies cleanly. The problem arises when teams default to the same numbers for designs that are not in any of these categories.
A decision framework for IDI sample size
Three steps replace the “12 is enough” default:
-
Count the cells. Segments × dimensions × waves. A study comparing three buyer personas across two regions in a single wave has six cells. A longitudinal study comparing the same six cells across four waves has 24 cells.
-
Apply per-cell saturation. For single-segment exploratory work, 5-12 per cell. For thematic comparison, 12-20 per cell. For code-saturation work and cross-cultural designs, 16-24 per cell. The numbers come from the saturation literature; the multiplier comes from your design.
-
Buffer for attrition. Screening attrition, no-shows, and quality-rejected interviews typically run 10-20% above the target completion rate. Plan accordingly.
The output is a sample-size floor that reflects the research design, not a budget anchor. If the resulting number is uncomfortably large under the cost regime you are budgeting against, that is a signal about the cost regime, not the design.
How does User Intuition change the sample-size calculus for IDIs?
Traditional IDI economics — hundreds of dollars per interview, weeks of moderator calendar, single-threaded recruiting, manual transcription, manual coding — forced researchers to default to minimum-defensible-sample. The saturation literature emphasizes minimums because minimums were what teams could afford. Segment-level saturation, cross-cultural code saturation, longitudinal wave-level saturation: all of these are methodologically standard but economically aspirational under traditional fieldwork.
User Intuition runs in-depth interviews AI-moderated, in parallel, across unlimited concurrent sessions. The unit cost is roughly an order of magnitude below traditional moderated IDI work, the moderator-calendar bottleneck disappears, and structured laddering five to seven levels deep happens on every conversation. A study with five segments at fifteen IDIs per cell — 75 interviews — runs in days rather than months, on the budget that traditional fieldwork would spend on a single twelve-IDI exploratory wave.
What this changes in practice:
- Per-segment saturation becomes the default, not a stretch goal. Studies that previously had to pick one or two segments to interview deeply can now cover all the segments the research design calls for.
- Cross-cultural code saturation is reachable. Sixteen to twenty-four IDIs per cohort across four or five cohorts is a single sprint, not a six-month international fieldwork program.
- Longitudinal sample sizes hold up. Wave-over-wave comparisons can be sized to detect real shifts rather than noise.
- Research-design choices replace budget choices. The IDI sample-size conversation moves from “what is the minimum we can justify?” to “what does the question actually need?”
See the user research solutions page for how this shows up across exploratory, segmentation, and longitudinal designs.
Bottom-line guidance
The textbook IDI sample sizes — 5-8, 12, 16-24 — are correct inside the scope conditions of the studies that produced them and incorrect when applied outside those conditions. Single homogeneous segment with a single research question: the textbook numbers hold. Anything segmented, cross-cultural, niche, or longitudinal: multiply per cell, do not divide across cells.
The deeper shift is that the saturation literature was always implicitly answering the question “what is the smallest sample I can defend given how expensive each interview is?” The right question once interviews are no longer the binding constraint is “what does saturation look like per segment when the design, not the budget, sets the sample?”
Start with the cells, apply per-cell saturation, buffer for attrition, and let the research design pick the number.