← Reference Deep-Dives Reference Deep-Dive · 12 min read

IDI Coding and Scoring: The Qual-to-Quant Bridge

By Kevin, Founder & CEO

A 20-interview in-depth-interview (IDI) study produces somewhere between 200,000 and 240,000 words of raw transcript. Twenty hours of conversation. Hundreds of pages. By the time a research lead has read every transcript once, taken margin notes, drafted a codebook, double-coded a reliability sample, synthesized themes, and packaged findings into a deck stakeholders can act on, two to three weeks of senior analyst time have evaporated. For many product, strategy, and CX decisions, the window that prompted the research has already closed by then.

This guide is about the methodology that bridges rich qualitative narrative to structured findings — the step that turns transcripts into evidence. It covers what coding is, when to let codes emerge versus apply a predefined codebook, how to design a codebook that survives intercoder reliability checks, when thematic codes warrant frequency counts, and how AI-assisted coding changes the economics of the whole process. The companion guide on the full six-step IDI analysis process takes a wider view; this one zooms in on the qual-to-quant bridge specifically.

What qualitative coding actually is

Qualitative coding is the systematic act of labeling segments of transcript text with short tags — codes — that capture what each segment is about, what it means, or what it does in the participant’s account. A single sentence can carry multiple codes. A 60-minute interview transcript typically produces 80-150 coded segments. A 20-interview study consolidates into 150-300 distinct codes after deduplication.

The point of coding is not categorization for its own sake. It is the construction of an evidence trail. Every finding in the final report should be traceable backward: claim to theme to code to specific transcript excerpts. Without that chain, qualitative findings are opinions defended by intuition, and the first stakeholder who asks “where did this come from?” exposes the gap.

Two coding passes are standard:

  • Open coding assigns descriptive codes to each meaningful segment without imposing a prior framework. The goal is granularity — over-coding is fixable, under-coding is not. Use the participant’s own language where possible (“spent 30 minutes on hold” rather than “excessive wait time”) because in-vivo codes preserve framing that interpretive codes flatten.
  • Axial coding groups open codes into categories and identifies the relationships between them. “Spent 30 minutes on hold,” “transferred three times,” and “no callback option” might consolidate under “support accessibility barriers.” Axial codes are the scaffold themes will eventually rest on.

A common failure mode at this stage is conflating coding with theming. Codes are descriptive. They name what is in the data. Themes are interpretive. They name what the data means. Compressing these two steps — jumping from “I see eight people mentioned switching costs” straight to “switching costs are a barrier” — is how teams produce findings that fall apart under questioning. The thematic claim has not been earned by the coding work.

Inductive vs deductive coding: which one fits the study?

Two approaches sit at opposite ends of a spectrum, and most rigorous studies live somewhere in between.

Inductive coding lets codes emerge from the data. The analyst reads transcripts without a pre-existing framework and assigns codes based on what participants actually say. This is the right approach when the research question is exploratory — you are entering a problem space you do not fully understand, and you do not want to impose categories participants would not use themselves. Most foundational discovery work, customer-journey research, and ethnographic studies use inductive coding by default.

Deductive coding applies a predefined codebook derived from prior research, theoretical frameworks, or strategic priorities. The analyst knows the categories before opening the transcript and assigns segments to them. This is the right approach for longitudinal studies where year-over-year comparability matters, for regulated research (clinical, financial-services compliance) where a defined framework is required, and for studies that explicitly test whether a hypothesized pattern shows up.

Hybrid coding — the practical default for applied research — runs open coding on the first 5-8 transcripts to surface participants’ own framing, stabilizes a codebook from those emergent codes, then applies the codebook deductively to the remainder. The first wave protects against framework imposition; the stabilization protects against scope drift. If new codes are still emerging after interview 12-15, the research question is broader than the sample can support and you should either tighten the question or expand the sample.

Codebook design: structure that survives review

A codebook is a living document that defines every code in the study, provides inclusion and exclusion criteria, and lists representative excerpts. It is the artifact that makes coding auditable, repeatable, and defensible.

A well-designed codebook entry contains five elements:

  1. Code name — short, distinctive, ideally three words or fewer. Avoid jargon the participants did not use.
  2. Definition — one sentence that names what the code captures.
  3. Inclusion criteria — what counts as this code. Be specific about scope.
  4. Exclusion criteria — what looks like this code but is actually something else. This is the field that prevents coder drift.
  5. Example excerpts — two or three verbatim segments that clearly belong to this code. Pull them from the actual transcripts, not invented examples.

Codebook size scales with research scope. A focused single-product evaluation might consolidate to 30-50 codes. A broader category exploration runs 80-150. Above 150 distinct codes you are usually over-coding — different surface phrasings of the same underlying pattern that should have been merged.

The codebook stabilizes after coding the first 5-8 transcripts. From that point forward, new codes should be rare and should trigger a deliberate decision: extend the codebook, or absorb under an existing code. Every extension and merge should be timestamped and rationale-tagged. This audit trail is what makes the codebook hold up when a stakeholder asks why a particular finding was scoped the way it was.

Intercoder reliability: kappa, alpha, and when each fits

Intercoder reliability (ICR) is the formal check that two or more analysts coding the same material produce consistent codes. It is the answer to the question “is this finding a property of the data or a property of the coder?”

ICR matters whenever the stakes of the findings are high enough to invite challenge — board-level strategic decisions, regulatory submissions, M&A diligence, published research, vendor selection. For early-stage exploratory work, ICR is often skipped, which is acceptable as long as the report explicitly labels the work as exploratory.

Two statistics dominate the literature:

Cohen’s kappa is the standard for two coders applying one categorical dimension. It corrects for chance agreement — two coders flipping coins would agree about half the time, and kappa adjusts the raw agreement rate downward to account for that baseline. Values above 0.80 indicate strong agreement; 0.60-0.79 is moderate; below 0.60 means the codebook needs work or the coders need calibration training. Cohen’s kappa is the right metric when you have exactly two coders and a single coding dimension.

Krippendorff’s alpha generalizes to multiple coders, multiple coding dimensions, ordinal or interval data, and missing data. It is the more flexible statistic and the default for large-team studies, longitudinal programs, and any setting where the simpler kappa assumptions do not hold. Interpretation thresholds are similar (0.80+ strong, 0.667-0.79 acceptable, below 0.667 inadequate).

The procedure for both: have two coders independently code a 20% sample of transcripts using the stabilized codebook. Compute the statistic. If it clears threshold, proceed with a single coder for the remainder. If it does not, the codebook gets sharper definitions, the coders get calibration training on the divergent codes, and you recode the reliability sample. This step adds 4-6 hours per coder for a 20-interview study and is the single highest-leverage investment in defensibility you can make.

From codes to themes: the synthesis step

Coding produces 150-300 codes. Stakeholders need 4-8 findings. The collapse from one to the other is theme development — the interpretive layer that turns descriptive labels into actionable patterns.

A theme is not a code and not a category. It is a patterned response within the data that captures something important about the research question. The category “support accessibility barriers” might roll up into a broader theme of “institutional indifference” — a pattern where participants interpret operational friction as a signal that the company does not value their relationship. The theme is the interpretive jump that connects what people said to what it means for the decision the research was commissioned to inform.

Every candidate theme has to pass three tests:

  • Internal consistency. Do the excerpts grouped under this theme actually share the pattern you are claiming? Read them together end-to-end and check.
  • External distinctiveness. Is this theme meaningfully different from the other themes, or is it a restatement of the same underlying idea in different language?
  • Explanatory power. Does this theme help answer the research question? Interesting patterns that do not connect to the question are distractions, not findings.

Aim for 4-8 major themes. Fewer than four usually means the analysis is operating at too high a level of abstraction (a single mega-theme like “users want value” is not a finding). More than eight usually means the synthesis work has not pushed far enough — some “themes” are still categories that need further consolidation.

The qual-to-quant bridge: when do frequency counts belong?

The most contested methodological question in IDI analysis is when, if ever, to attach numbers to themes. “Eight of 20 participants raised X” reads like quantitative evidence and is often interpreted that way by stakeholders. The question is whether it should be.

Frequency counts are appropriate when:

  • The research question is genuinely about prevalence (“how widespread is this concern”)
  • The sample was designed to support prevalence claims (representative recruitment, defined segments, sufficient n per segment)
  • The coding was systematic enough that absence of a code is meaningful (not just an artifact of which questions came up in which conversations)

Frequency counts are inappropriate when:

  • The research question is about meaning, mechanism, or causality rather than prevalence
  • The sample is purposive or theoretical (recruited to surface specific perspectives, not to represent a population)
  • Code absence is ambiguous (a participant might hold the view but not have raised it within the interview’s time budget)

The practical rule: use frequencies for descriptive incidence claims (“X concern surfaced in roughly half of interviews”), use verbatim accounts for explanatory claims (“here is how a senior buyer described the trade-off they were making”), and never let a frequency count substitute for interpretation. The number tells you something happened; the verbatim tells you why it mattered.

When IDI findings genuinely need to be quantified at population scale — which they sometimes do, particularly for prioritization decisions — the better path is a follow-on survey that operationalizes the themes from the IDIs into measurable items. The qualitative work generates the hypotheses; the survey tests them. Counting qualitative codes is a shortcut that usually gets challenged the moment it reaches a sophisticated audience.

The historic labor cost — and what changed

For most of the last three decades, coding has been the single most expensive line item in qualitative research budgets. A senior analyst codes a 60-minute interview transcript in roughly 2-3 hours of focused work. Theme development and synthesis add another 20-40 hours on top of coding time. Reliability checks add 4-6 hours. Reporting adds 15-25 hours. For a 20-interview study, total analyst time runs 100-150 hours. At fully-loaded senior consultant rates, that is $20,000-$40,000 of pure analysis labor before any conclusions reach a stakeholder.

This cost structure has shaped the industry’s habits. Studies got smaller (10-12 interviews became typical even when the question warranted 30). Codebooks got shallower (researchers stopped at axial coding because going further was uneconomic). Cross-study comparison stopped being attempted. The depth qualitative research is supposed to provide eroded under cost pressure.

AI-assisted coding has begun to change this. Modern AI models trained on conversational data can scan a transcript, identify likely thematic segments, propose codes against an existing codebook, and flag novel patterns that do not fit. The analyst’s role shifts from coding-from-scratch to suggestion-then-confirmation: faster, more consistent across the corpus, and freeing the analyst’s attention for the interpretive work that AI cannot do. Coding time drops by 60-80% on most studies. Reliability improves because the AI does not get fatigued at hour seven the way a human coder does.

This is not a replacement of analyst judgment. AI-assisted coding flags what to look at; the analyst still decides what it means. The misuse pattern — taking AI-suggested themes as final findings without analytical engagement — produces shelf-ware faster, but it produces shelf-ware. Used correctly, AI handles the mechanical layer of coding and gives the analyst back the time to actually interpret.

Cross-study pattern detection: where research compounds

The largest unrealized return on qualitative research is cross-study pattern detection. A team that runs 8-12 IDI studies a year accumulates 200-300 hours of conversation and tens of thousands of coded segments. Most of that material is single-use — the study ships, the deck circulates, and the transcripts vanish into a shared drive nobody opens again.

This is where qualitative research most underperforms its potential. The same code — “trust as proxy for switching cost,” “agentic anxiety,” “manager skepticism of consumer-grade tools” — that surfaces in a March product study often re-surfaces in a June pricing study and a September competitive teardown. Connecting those instances is where category-level insight lives. But connecting them requires either:

  • A consolidated codebook applied across all studies (rare in practice, because cross-study codebook reconciliation is itself a major project)
  • A persistent searchable transcript corpus with the prior coding intact (rare in practice, because transcripts get archived as PDFs that are not queryable by code)
  • A platform that does both as a workflow output (the path that has only become tractable in the last 18 months)

The shift from one-shot studies to a compounding research asset depends on this infrastructure. Without it, every study starts from zero. With it, the fourth IDI study automatically benefits from the coding work of the first three.

How does User Intuition handle IDI data analysis?

User Intuition runs AI-assisted coding as a built-in part of the interview workflow rather than as a separate downstream step. When transcripts return inside 24-48 hours, they arrive already segmented by topic with suggested codes attached against the study’s codebook. The analyst’s job becomes review — confirm, edit, reject, extend — rather than coding from a blank transcript, which is where 60-80% of historic analysis time went. The interpretive work that actually shapes findings still belongs to the analyst; the mechanical work of label-by-label segmentation no longer does.

The cross-study layer runs inside the Customer Intelligence Hub. Codes from prior studies persist in a searchable corpus tied back to the original transcript excerpts. When a new study surfaces a code that has appeared in three prior studies, the Hub flags it as a hub-level pattern and links the relevant excerpts together — the cross-study pattern detection that used to require a dedicated consulting engagement now shows up as a workflow output of the platform that ran the interviews. Teams running 8-12 studies a year stop losing the compounding return on their own research.

The collection-to-analysis handoff matters here. Because the in-depth interview platform applies consistent 5-7 layer laddering across every conversation, the transcripts feeding the coding step already carry the specificity that makes coding productive — the upstream quality is what makes downstream automation safe. Rich material in, structured findings out, evidence trail preserved end to end.

Bottom-line guidance

Three things determine whether IDI coding produces findings that survive review or shelf-ware that does not:

  • Codebook discipline. Stabilize the codebook by interview 8. Document every extension and merge. Run intercoder reliability on a 20% sample if the findings will face challenge.
  • Honest qual-to-quant policy. Use frequencies for incidence claims, use verbatims for explanatory claims, run a follow-on survey when prevalence at population scale actually matters.
  • Cross-study infrastructure. A single study is a finding. Twelve studies sharing a codebook and a searchable corpus is a category-level point of view. Without the infrastructure, the second one never compounds out of the first.

For small studies (under 15 interviews) on contained questions, careful manual coding in a spreadsheet still works and probably costs less than tooling up. For studies over 30 interviews, longitudinal programs, or any team running enough qualitative volume that cross-study comparison starts to matter, AI-assisted coding inside a persistent research platform is the path that compounds. The bottleneck has moved from coding labor to study design — which is where it should have been all along.

See the platform in action →

Note from the User Intuition Team

Human moderation, done well, is the gold standard. A skilled moderator reads silence, follows a half-thought, knows when to push and when to wait. The trouble is what that costs at scale: one moderator, one participant, one hour at a time — and by interview a hundred, even the best aren't asking the same questions they asked at interview one.

User Intuition keeps what makes great moderation great — the depth, the laddering, the patient probing — and removes what holds it back. The AI moderator ladders 5–7 levels deep on every interview, with no fatigue wall and no calendar to manage. It runs hundreds of conversations in parallel, so a study fills in hours instead of weeks. Setup takes five minutes: upload your study guide and we turn it into a plan, write the screener, recruit from our 4M+ panel, and launch. Every interview is automatically scored on Length, Depth, and Coverage; if it doesn't pass, you don't pay. No refund required.

Preview a real study output before you pay — the only platform in the industry that lets you evaluate the work first. A 10-interview study lands at $200 in 24–48 hours. Already convinced? Sign up and try with 3 free quality interviews.

Frequently Asked Questions

Qualitative coding is the line-by-line act of labeling segments of transcript with short descriptive tags — 'switching trigger,' 'price as proxy for risk,' 'manager indifference.' Theming is the synthesis step that groups dozens or hundreds of codes into a handful of higher-order patterns that answer the research question. Codes are descriptive and granular; themes are interpretive and integrative. Skipping coding and jumping straight to themes is the most common reason qualitative findings collapse under stakeholder questioning — there is no audit trail from claim back to transcript.
Use inductive (open) coding when you are exploring an unfamiliar problem space — let codes emerge from the data so you do not impose categories the participants do not actually use. Use deductive (a priori) coding when you are testing a specific framework, comparing against a prior study, or running a longitudinal program where stable categories matter for trend detection. Most rigorous studies are hybrid: open coding on the first 5-8 transcripts to surface the participants' own framing, then a stabilized codebook applied to the remainder.
Intercoder reliability is the formal check that two analysts coding the same transcripts produce consistent codes. It matters whenever the findings will face stakeholder scrutiny — board readouts, regulatory submissions, published research, vendor selection decisions. Cohen's kappa works for two coders and one categorical dimension; Krippendorff's alpha generalizes to multiple coders, multiple dimensions, and missing data. The accepted floor is 0.70 for exploratory work and 0.80 for confirmatory work. Below those, your codebook needs sharper definitions or your coders need calibration training.
Frequency counts are appropriate when the research question is genuinely about prevalence — 'how often does this concern come up' — and when the sample was designed to support that claim (representative sampling, sufficient n per segment). They are inappropriate when the question is about meaning, mechanism, or causality. A single rich account of a switching decision can be more analytically important than twelve brief mentions. The rule: report frequencies for descriptive incidence claims, report verbatim accounts for explanatory claims, and never let a frequency count substitute for interpretation.
User Intuition runs AI-assisted coding as part of the interview workflow. Transcripts come back inside 24-48 hours already segmented by topic with suggested codes attached; the analyst confirms, edits, or rejects each one rather than coding from a blank page. Cross-study pattern detection runs inside the Customer Intelligence Hub — when the same code surfaces across a new study and three prior ones, it shows up as a hub-level pattern automatically. The qual-to-quant bridge that used to require a separate analytics engagement becomes a workflow output of the same platform that ran the interviews.
Get Started

Put This Research Into Action

Run your first 3 AI-moderated customer interviews free — no credit card, no sales call.

Self-serve

3 interviews free. No credit card required.

See it First

Explore a real study output — no sales call needed.

You only pay for quality interviews.

Every interview is automatically scored against your brief. Misses aren't charged.

No contract · No retainers · Results in 72 hours