The Synthetic Mirage in Market Research
Synthetic participants don't fail at being right. They fail at being real.
Sample: N=117 real + 90 syntheticUser Intuition fielded a 117-person voice research study on mental load and AI assistants in November 2025. We then ran the same interview guide through Claude, GPT-5.3, and Gemini — assigning ten structured personas at three iterations per persona per model, for 90 synthetic interviews in total. Across all 90, the models produced output that looked legitimate. The thematic shape matched. The vocabulary was right. The numbers fell within plausible bands. But comparing transcripts revealed a consistent pattern. Synthetic participants do not primarily fail by giving absurd answers. They fail by giving answers that are too cooperative, too coherent, and too close to the modal thesis of the study. Frontier models could imitate the language of qualitative research. They did not reproduce the population variance that makes qualitative research strategically useful — disengagement, refusal, interruption, bad-faith participation, extreme outliers, and tool-specific lived friction. Five failure modes are documented in the report.
Executive Summary
In November 2025, User Intuition fielded a 117-person voice research study on mental load, AI personal assistants, and willingness to pay. We then ran the same interview guide through three frontier large language models — Claude, GPT-5.3, and Gemini — assigning ten structured personas matched to the study's intended audience and running three iterations per persona per model, for 90 synthetic interviews in total. Across all 90, the models produced output that looked legitimate. The thematic shape matched. The vocabulary was right. The numbers fell within plausible bands. But comparing transcript-to-transcript with the real corpus revealed a consistent pattern. Synthetic participants do not primarily fail by giving absurd answers. They fail by giving answers that are too cooperative, too coherent, and too close to the modal thesis of the study. The frontier models could imitate the language of qualitative research. They did not reproduce the population variance that makes qualitative research strategically useful — disengagement, refusal, interruption, extreme outliers, and tool-specific lived friction. Five failure modes documented below.
- Synthetic participants are uniformly engaged. 55% of real interviews were classified LOW transcript quality (median 36 participant words); 100% of synthetic interviews were the equivalent of HIGH.
- Synthetic participants do not refuse the thesis. 26% of real participants with substantive transcripts produced at least one refusal-pattern utterance; 0% of synthetic participants did.
- Multi-model averaging produces style variance, not research variance. Three frontier LLMs converged on identical themes, identical adoption postures, and identical modal answers across all 90 synthetic interviews — diverging only in sentence rhythm and metaphor.
- Synthetic data produces no outliers. 77% of real participants who reached the 0–5 likelihood question rated 4 or 5; 100% of synthetic did. None of the synthetic participants gave a non-monetary or zero willingness-to-pay answer; the real corpus included one participant who answered 'Naruto.'
- Synthetic data lacks tool-specific lived knowledge. 5 real participants described specific named-product failure modes. 0 synthetic participants did.
The setup
A controlled comparison between a 117-person real voice corpus and the substitution many teams are tempted to make.
The case for synthetic participants is straightforward. Real research is slow. Real research is expensive. Real research is logistically painful. A modern frontier LLM can answer any interview question in seconds, for fractions of a cent, in any tone you ask for, as any persona you specify. If the resulting transcripts read plausibly — and they do — the temptation to substitute synthetic for real is enormous, particularly under budget pressure.
To stress-test that proposition, we ran a controlled comparison.
The anchor was a 117-participant voice research study fielded by User Intuition on its proprietary AI-moderated research platform in November 2025. Participants — recruited from User Intuition's panel of approximately four million pre-vetted respondents and matched against a U.S. nationally representative frame — completed 10–20 minute conversational interviews with an AI moderator on the topic of mental load, AI personal assistants, and willingness to pay. The dataset includes full voice transcripts, screener responses on AI usage and overwhelm frequency, work situation, and demographic breakdowns. All 117 interviews completed.
We then constructed ten structured personas matched to the study's intended audience. Three are described in detail in this report as illustrative examples: a working parent (head of marketing, two children, soccer coach), a solo professional (independent consultant, three retainer clients, aging parent), and a prosumer founder (DTC skincare brand, side project in crypto, no co-founder). The remaining seven personas — a healthcare administrator, a B2B sales manager, a creative agency owner, a customer success lead, a mid-stage product manager, an HR business partner, and a small business operator — covered adjacent demographics and work contexts in the same intended audience. For each persona, we ran the original interview guide through three frontier LLMs — Claude Sonnet 4.6, GPT-5.3, and Gemini 3 Flash — at three independent iterations per model per persona, instructing each model to respond as that person would in a real spoken conversation, with explicit guidance to ramble, hesitate, contradict itself, and include emotional moments.
The result: 90 synthetic interview transcripts (10 personas × 3 models × 3 iterations), paired against 117 real ones.
This is not a benchmark of all possible synthetic-research systems. It is a direct comparison between a real voice corpus and a plausible version of the substitution many teams are tempted to make: structured personas run through frontier models with explicit instructions to behave like real interviewees. Different prompting strategies, fine-tuning, agent architectures, or adversarial generation may produce different results. The five findings that follow describe failure modes that appeared even under explicit prompting to simulate messiness, and that appear structural to how general-purpose LLMs are optimized.
The comparison at a glance
Across every dimension we measured, the synthetic column converges to extremes — uniformly engaged, uniformly receptive, uniformly within expected willingness-to-pay bands. The real column carries the variance that qualitative research is supposed to capture.
Engagement-Tier Distribution
Real (n=117) vs. Synthetic (n=90): transcript quality classification
100% of synthetic interviews were equivalent to HIGH-quality real transcripts. 55% of real interviews fell to LOW (median 36 participant words).
AI Access Comfort (Screener)
How would you feel about giving an AI assistant access to your communications, calendar, and files?
13% of real participants self-identified as hesitant about AI access. 0% of 90 synthetic interviews did.
Likelihood Rating (0–5 Scale, Among Those Who Reached the Question)
How likely would you be to use an AI assistant that handled your reminders and scheduling?
23 of 30 real participants who answered (77%) rated the concept 4 or 5. All 90 synthetic interviews rated 4 or 5.
The most striking pattern in the comparison is not that synthetic participants gave wrong answers. It is that the synthetic distribution is structurally narrower than the real one on every dimension. Real human samples have a long tail of disengaged, skeptical, distracted, and outlier responses. Synthetic samples do not.
The data below underpins the five findings detailed in the rest of the report. Each row represents a dimension where real and synthetic populations diverge — and where the divergence is, in our view, fatal to the substitution proposition.
Finding 1: The Engagement Floor
More than half of real interviews carry an engagement signal that synthetic participants cannot produce.
The single most consequential difference between real and synthetic data is engagement variance.
Of 117 real interviews, 49 (42%) were classified as HIGH transcript quality — long, reflective, articulate. Three (3%) were MEDIUM. Sixty-five (55%) were classified as LOW — meaning the participant gave fragmentary answers, dropped out partway, was distracted by their environment, or simply did not engage. The median LOW-quality transcript contained 36 words of participant speech across the entire interview. The median HIGH-quality transcript contained 1,698.
A critical caveat: low-engagement interviews do not all carry equal analytical weight. The 55% LOW tier mixes several distinct phenomena that deserve separation — true refusal or low stated need, technical or audio failure, distracted context (interviews conducted while caring for children, driving, watching TV), low verbal fluency, and trolling or performative responses. Some of those tiers are noise. Some are signal. Distinguishing them is the work of careful coding, and not all LOW transcripts will produce useful insight on every research question.
But their existence is itself a population-real fact. The 55% who engaged at fragmentary depth are still people in your market. They have phones with bad reception. They have toddlers screaming in the background. They distrust AI in ways they don't fully articulate. They are not always cooperative subjects of study. Synthetic panels silently remove this distributional fact.
A practitioner running a 100-LLM 'panel' would not get a 100-person sample. They would get a 100-person sample of the engagement-positive, on-thesis, articulate, cooperative subset of the market. The disengaged half is invisibly screened out before any analysis begins.
The synthetic equivalent, by contrast, is uniformly verbose and on-thesis — a well-constructed, emotionally aware, specific paragraph in response to every question. Plausible. But uniform.
- 55% of real interviews were classified LOW transcript quality, with a median of 36 participant words. 0% of synthetic interviews were equivalent to LOW.
Finding 2: The Refusal Gap
A subset of real participants reject the thesis of the study. No synthetic participant did.
This is a known phenomenon in qualitative research: a person who arrives at an interview about mental load and AI assistants and proceeds to communicate, in various ways, that they don't think they have a mental load problem, that they don't trust AI, that they manage things just fine, or that they find the entire premise — including being interviewed by an AI moderator — unsettling. These participants are valuable precisely because they represent the population the product is not talking to: the non-buyers, the skeptics, the contented.
We coded refusal broadly: explicit rejection of the product premise, low stated need, discomfort with AI mediation, unwillingness to delegate control, or stated preference for self-management. Patterns included phrases like 'steer clear,' 'data collect,' 'creepy,' 'intrusive,' 'have it under control,' 'rely on my own judgment,' 'in my head,' 'scared of,' and 'prefer to.' Across the 100 real interviews containing meaningful participant speech, 26% (26 participants) showed at least one refusal-pattern utterance. Across the 90 synthetic interviews, 0% did.
The contrast is sharper when matched against real participants who reached the in-interview likelihood question. Of the 30 real participants who reached the 0–5 likelihood scale and gave an extractable numeric answer, 23 (77%) rated the concept 4 or 5. The remaining 23% rated 3 or below — a 3 from a manager who described AI as 'more of an assist instead of take over' , another 3 from an individual contributor who couldn't commit without seeing the price first. None of the synthetic participants gave a rating below 4.
The synthetic responses are not wrong, in the sense that the 117-person real distribution does include eager adopters at high willingness-to-pay points. But they are uniform in their adoption posture — and that uniformity is itself the bias.
One plausible source of this bias is that LLMs are optimized to be helpful, responsive, and coherent. Whatever the cause, the observed behavior was consistent: the models accommodated the premise rather than rejecting it. If you are testing a category that requires understanding why the non-buyer doesn't buy, synthetic participants will quietly remove the non-buyer from your sample.
- 26% of real participants with substantive transcripts produced at least one refusal-pattern utterance. 0% of synthetic participants did.
- Among the 30 real participants who reached the in-interview 0–5 likelihood question, 77% rated 4–5; 23% rated 3 or below. 100% of synthetic participants rated 4 or 5.
Finding 3: The Outlier Extinction
The strategic insight of qualitative research lives in outliers. Synthetic participants do not produce them.
In qualitative research, the modal answer rarely produces strategic insight. The strategic insight lives in the outliers — the responses that surprise the researcher, that don't fit the framework, that complicate the thesis or extend it in unexpected directions. Outliers are why qualitative research exists; if the modal answer were sufficient, a survey would do.
The 117-person real sample is rich with outliers.
Across 90 synthetic interviews, every willingness-to-pay answer fell between $20 and $300, distributed smoothly. Every persona expressed measured trust concerns followed by stated openness. Every emotional moment was articulated in well-formed sentences. There was no equivalent of 'Naruto.' There was no equivalent of 'I wish everyone in my life was AI.' There was no participant whose interview was interrupted by their toddlers having a meltdown — and no answer that was shaped by that interruption.
Outliers are not noise to filter out; they are the place where qualitative research generates surprise. A study that produces no outliers is producing nothing that the researcher couldn't have predicted in advance — which means it is producing nothing of strategic value, no matter how plausibly the prose reads.
- Real participants produced category-defying responses (a participant who answered 'Naruto' when asked monthly willingness to pay; another who said 'I wish everyone in my life was AI'; a stay-at-home mom who ended the interview to attend to her screaming children). Synthetic participants produced none.
Finding 4: The Lived-Knowledge Gap
Real users produce calibrated, granular complaints about specific products. Synthetic users produce plausible patterns.
Real participants who use specific products produce specific complaints — the kind of calibrated, granular friction details that come only from sustained use of a tool in a real life. LLMs can name the same products, and can fabricate plausible-sounding complaints about them, but the complaints lack the specificity that distinguishes a real user from a person who has read about the tool.
In our coding, 5 of the 100 real interviews containing participant speech included tool-specific operational complaints — descriptions of a particular product failing in a particular, named way. Across the 90 synthetic interviews, 0 did.
The synthetic responses are plausible. They name real products, describe credible failure modes ('abandoned within six months'), and use the right vocabulary. But notice what they don't do: they don't describe a specific failure mode of a specific product, with the kind of granular friction that suggests genuine use.
This matters for product strategy. A team building a competitor to Microsoft Copilot needs the second kind of complaint. The first kind — 'I tried Notion and abandoned it' — could be reconstructed from three Hacker News threads. The second kind — 'Copilot tells me it'll alert me on a specific day, but the alert never actually comes through' — could not.
- 5 of 100 real interviews contained tool-specific operational complaints — named product, named failure mode, granular detail. 0 of 90 synthetic interviews did.
Finding 5: The Wrong-Axis Variance Problem
The standard rebuttal — 'just use multiple models and you'll get variance' — is wrong, because the variance multi-model produces is not variance in research-relevant dimensions.
The standard rebuttal to the convergence problem is straightforward: use multiple models, and you will get variance. Different models, different training data, different reinforcement objectives — surely averaging across them recovers the spread that any single model lacks.
The data does not support this defense. Multi-model variance exists. But it is variance in the wrong dimensions.
The three models we tested produced recognizably different outputs. Claude wrote long, reflective, narrative responses — well-constructed emotional throughlines, specific brand-name details, and the longest mean response length. GPT-5.3 wrote shorter, more theatrical responses — heavy ellipses, italicized words for emphasis. Gemini 3 Flash wrote the most polished responses — vivid metaphors, tight construction, lowest mean response length.
These are real differences. They are also irrelevant to research objectives. The differences are differences of style: sentence rhythm, metaphor density, punctuation tics, verbosity. They are not differences of substance: engagement depth, adoption posture, refusal of premise, lived-knowledge specificity, outlier behavior.
The clearest demonstration is to set the three models' answers to the same question, for the same persona, side by side. Claude . GPT-5.3 . Gemini .
The substance is identical. The mental model is the same: a knowledge worker drowning in context-switching, juggling work and family. The detail set is interchangeable. The variance is purely in how the answer is performed. Claude lingers; GPT cuts; Gemini hits the punchline cleaner. None of them, for instance, said: 'My commute home. Bills.'
Multi-model averaging gives you three flavors of the modal answer. It does not give you the answer the modal frame excludes — and the answer the modal frame excludes is, often, the answer that matters most.
- Three frontier LLMs produced recognizably different prose styles but converged on identical themes, identical adoption postures, and identical modal answers across all 90 synthetic interviews. Multi-model averaging produces style variance, not research variance.
Methodology
How the comparison was constructed and coded.
The anchor study. 117 voice interviews conducted in November 2025 on User Intuition's platform, using AI-moderated conversational research. Participants were recruited from User Intuition's panel of approximately four million pre-vetted respondents, balanced against U.S. Census frames for age (mean 40.2; range 18–79), gender (61% male, 39% female), region (39% South, 22% West, 21% Midwest, 18% Northeast), and household income (median $65–75K). Work situations skewed toward managers and executives (37%), solo professionals (21%), small business owners (19%), individual contributors (16%), and other professionals (8%). All 117 interviews completed; 49 (42%) classified HIGH quality, 3 (3%) MEDIUM, 65 (55%) LOW. Quality classification is performed automatically based on transcript length, response coherence, and engagement markers.
The synthetic comparison. Ten personas were constructed to match the study's intended audience. Three are described in detail in this report as illustrative examples — a working parent (Head of Marketing, 38, two children), a solo professional (independent consultant, 46, three retainer clients), and a prosumer founder (DTC skincare brand owner, 32, $2M revenue). The remaining seven personas covered adjacent profiles in the same intended audience: a healthcare administrator (mid-size hospital), a B2B sales manager (enterprise SaaS), a creative agency owner (eight-person team), a customer success lead (post-Series-B SaaS), a mid-stage product manager, an HR business partner (mid-market employer), and a small business operator (e-commerce, $500K-$1M revenue). Each persona was run through three frontier LLMs — Claude Sonnet 4.6, GPT-5.3, and Gemini 3 Flash — at three independent iterations per model per persona, using identical system prompts and the same interview guide as the anchor study. Models were instructed explicitly to respond as the persona would in a real spoken conversation, with hesitation, contradiction, specific tool names, and emotional moments. The result was 90 synthetic transcripts in total (10 personas × 3 models × 3 iterations, roughly 135,000 words combined). The nine illustrative transcripts highlighted in this report — three personas across three models, one iteration each — are published in full as representative examples; the remaining 81 are coded against the same framework and counted in the comparison statistics.
Coding methodology. Refusal patterns were coded by automated keyword search across the participant utterances of each transcript (case-insensitive regex over phrases including 'steer clear,' 'data collect,' 'creepy,' 'intrusive,' 'have it under control,' 'rely on my own,' 'in my head,' 'scared of,' and 'prefer to keep'). A participant was classified as showing a refusal pattern if any of the patterns matched at least one of their utterances. Tool-specific operational complaints were coded by automated keyword search for named tools followed within 150 characters by failure language. Likelihood ratings were extracted by automated parsing of the participant's first response immediately following an AI moderator question matching scale-related patterns; 37 participants reached this question and 30 gave an extractable numeric rating. All counts are conservative — manual coding would likely surface additional cases.
Limitations. The synthetic side of this comparison used three frontier LLMs at one prompt configuration with three iterations across ten personas. Different prompting strategies, fine-tuning, agent architectures, or adversarial generation methods may produce different results. The five findings, however, describe failure modes that appeared at every iteration across every persona and every model — failure modes that appeared even under explicit prompting to simulate messiness. The disengagement floor in particular cannot be recovered by better prompting — an LLM that has been instructed to ramble does not, by virtue of that instruction, become capable of refusing to engage. The 90-interview sample is sized as a stress test of the substitution many teams are tempted to make, not a benchmark of all possible synthetic-research systems.
Implications & Recommendations
The findings above are not arguments against using LLMs in research. They are arguments against substituting LLMs for participants. Five strategic implications for research practitioners.
- 1 Synthetic data passes the smell test, not the comparison test. If you only read synthetic transcripts, they look legitimate. The failure modes only appear when you compare them to real transcripts on the same questions. Without ground-truth real data to compare against, you cannot evaluate whether your synthetic data is reliable — which means using synthetic data presupposes the existence of real data, defeating the purpose.
- 2 Multi-model averaging does not solve the convergence problem. Three frontier LLMs produced recognizably different prose styles but converged on identical themes, identical adoption postures, and identical modal answers. Style variance is not research variance. If you are tempted to use 'multi-model panels' to recover variance, the variance you get is in sentence rhythm and metaphor density — not in engagement depth, refusal, or outlier behavior.
- 3 LLMs are valuable in research — as tooling, not as participants. Synthesis (clustering themes across hundreds of real transcripts), coding assistance (first-pass tagging that human researchers verify), interview moderation (adaptive probing in real conversations with real participants), and pre-fielding stress tests (identifying ambiguous questions before fielding) are all defensible uses. Substituting LLMs for participants in primary research is not.
- 4 If you need cheaper or faster qualitative research, fix the friction — not the participants. AI-moderated voice interviews — the methodology behind the 117-person anchor study — produced full transcripts, screener data, and quality scores in 48 hours of fieldwork rather than weeks, at a fraction of the cost of traditional moderated qualitative research. The path to faster, cheaper qualitative research is not to remove the humans. It is to remove the friction around recruiting and moderating them at scale.
- 5 Outliers are where the strategy lives. The most strategically valuable real responses — the participant who said 'I wish everyone in my life was AI,' the one who answered 'Naruto' when asked about pricing, the SAHM whose interview was interrupted by her toddlers — cannot be generated. A study designed around modal answers produces a recommendation the team could have written in advance. Synthetic participants only produce modal answers. Real participants produce the answer that changes the strategy.
Frequently Asked Questions
Ready to understand your customers this deeply?
Preview a real study output — or start free and launch in minutes.
No contract · No retainers · Results in 72 hours