The Benchmark Trap
Every team that runs a concept test eventually asks the same question: “Is this score good?”
It is the wrong question, but it is understandable. Teams want a number that tells them to go or kill. They want a threshold that removes ambiguity. And an entire industry of research vendors has been happy to provide one, usually in the form of normative databases that promise to tell you whether your 42% top-2-box purchase intent is above or below the “norm.”
The problem is that these benchmarks are far less reliable than they appear, and over-reliance on them leads to worse decisions than having no benchmark at all.
Why Top-2-Box Scores Mislead
Top-2-box (T2B) scoring takes the percentage of respondents who selected the top two options on a scale (typically “definitely would buy” and “probably would buy”) and reports it as a single number. It is the default metric in concept testing, and it obscures more than it reveals.
The acquiescence problem. Survey respondents skew positive. In most concept tests, “probably would buy” captures a mix of genuine interest and polite non-rejection. The gap between “probably would buy” on a survey and actual purchase behavior is enormous, and it varies by category, culture, and how the question is framed.
Scale interpretation varies. A “4 out of 5” means different things to different people. Some respondents reserve 5 for concepts they find extraordinary. Others use 5 as their default for anything acceptable. Aggregating these into a single T2B number treats fundamentally different response patterns as equivalent.
Context dependence. The same concept will score differently depending on:
- How it is presented (description vs. visual board vs. prototype)
- What concepts were shown before it (contrast effects)
- Whether the respondent was primed with a problem statement
- The specificity of the target audience screening
A T2B score of 45% from a well-screened audience seeing a polished concept board is not comparable to 45% from a loosely screened panel seeing a text description.
Industry Benchmark Ranges
With those caveats firmly in place, here are directional benchmark ranges that research teams commonly reference. These are approximate and should not be used as go/no-go thresholds.
| Category | T2B Purchase Intent Range | Notes |
|---|---|---|
| CPG / FMCG | 30-55% | Highly variable by subcategory; impulse categories skew higher |
| Consumer Tech | 20-40% | Lower because purchase requires more consideration |
| B2B / SaaS | 15-35% | Multi-stakeholder buying suppresses individual intent scores |
| Healthcare / Pharma | 20-40% | Regulated claims limit what stimulus can promise |
| Financial Services | 15-30% | Trust and switching costs dampen stated intent |
| Food & Beverage | 35-60% | Highest T2B because trial cost is low |
These ranges come from aggregated industry data and vary significantly by source. A concept scoring 25% T2B in B2B SaaS might be a strong result, while 25% in food and beverage would be cause for concern.
The Problem with Normative Databases
Large research firms maintain normative databases: collections of historical concept test results that serve as benchmarks. In theory, you test your concept, compare it to the database, and know whether you are above or below average.
In practice, normative databases have serious limitations:
Historical bias. Norms are built from past tests. If the database is dominated by concepts from 2015-2020, it reflects consumer expectations and competitive contexts that no longer exist. Categories evolve. What counted as innovative five years ago may be table stakes today.
Category mismatch. Your sparkling prebiotic beverage gets compared to a database “norm” that includes energy drinks, juice, and flavored water. The categories are different enough that the comparison is noise, not signal.
Methodological inconsistency. Were those historical tests run with the same scales? The same stimulus format? The same screening criteria? Usually not. Comparing your carefully designed test against a grab-bag of historical methodologies introduces error that the neat percentile ranking obscures.
Survivorship and selection effects. Normative databases over-represent concepts from large companies that test frequently. They under-represent scrappy startups and innovative concepts that break category conventions. Benchmarking against this skewed sample can penalize genuinely novel ideas.
The false precision problem. “Your concept scored in the 62nd percentile” implies a level of precision that the underlying data does not support. The confidence interval around that ranking is wide enough to make the difference between “above average” and “below average” statistically meaningless in many cases.
Building Internal Benchmarks
The most reliable benchmarks come from your own testing history. Internal benchmarks control for the variables that make external norms unreliable: your methodology, your audience, your stimulus format, your category.
How to Build Them
Step 1: Standardize your methodology. Use the same scales, the same stimulus format, and the same screening criteria across tests. Methodological consistency is the foundation. Without it, you are comparing apples to oranges even within your own data.
Step 2: Track every concept tested. Record not just the scores but the context: what stage was the concept at, what was the stimulus fidelity, how was the audience screened. This metadata matters for interpretation.
Step 3: Build your distribution. After 8-10 tests, plot your T2B scores. You will see your own range. After 20+, you will have a meaningful distribution with clear quartiles.
Step 4: Correlate with outcomes. This is where internal benchmarks become genuinely powerful. Track which concepts went to market and how they performed. Over time, you can identify the score ranges that predict commercial success in your specific context.
Step 5: Segment your norms. As your database grows, build separate benchmarks for different concept types (new products vs. line extensions), different audiences, and different stimulus formats.
The Timeline
Internal benchmarks are not instant. Here is a realistic ramp:
| Tests Completed | Benchmark Reliability |
|---|---|
| 1-5 | No benchmark; rely on qualitative signals |
| 6-10 | Rough directional range; you know what “high” and “low” look like |
| 11-20 | Meaningful quartiles; can rank new concepts against history |
| 20+ | Robust internal norms with outcome correlation |
At $20 per interview, the economics of building this testing history are accessible to teams of any size. A continuous testing program that evaluates 2-3 concepts per month builds a robust internal norm within a year.
When Benchmarks Help
Benchmarks are useful in specific situations:
- Comparing concepts against each other. Relative ranking within the same test is always more reliable than comparing against an external norm.
- Tracking over time. If your concepts are consistently improving (or declining) against your internal baseline, that trend is meaningful.
- Setting minimum thresholds. After enough tests with outcome data, you can identify the floor below which concepts rarely succeed in-market. This is a useful kill criterion.
- Communicating to stakeholders. Executives want a number. A percentile ranking against your internal database is a defensible way to provide one.
When Benchmarks Mislead
Benchmarks actively harm decision-making when:
- They replace qualitative understanding. A concept that scores 35% T2B but generates passionate enthusiasm from a specific segment is often a better bet than one scoring 45% with lukewarm, undifferentiated appeal. The number does not capture this.
- They become a bureaucratic gate. “Must score above the 60th percentile” as a rigid policy kills innovative concepts that challenge category conventions and initially confuse respondents.
- They ignore the “why.” Two concepts can score identically for entirely different reasons. One resonates on the core benefit but confuses on pricing. The other is clear on pricing but has a weak benefit. The right next step is completely different, and the benchmark tells you nothing.
- They create false confidence. A strong benchmark score does not mean the concept will succeed. Execution, pricing, distribution, timing, and competitive response all matter more than a pre-launch score.
Why Qualitative Depth Matters More Than the Number
The most valuable output of a concept test is not the score. It is the understanding of why people respond the way they do.
A 30-minute depth interview with 5-7 levels of probing reveals the reasoning, emotional response, comparisons, objections, and conditions under which a respondent would or would not engage with a concept. This qualitative depth tells you what to fix, what to amplify, and what to abandon in a way that no aggregate score can.
This is where AI-moderated concept testing changes the equation. Traditional quant-first concept testing gives you the score and leaves you guessing about the why. Depth interviews at scale give you both, with the qualitative insight to act on what the numbers mean.
A Better Framework
Instead of asking “Is this score good?”, ask:
- Is this concept better than the alternatives we tested? Relative comparison within a well-designed test is always more reliable than external benchmarks.
- Do respondents understand and articulate the core value proposition? If they can play it back in their own words, the concept communicates clearly.
- What are the specific objections, and are they fixable? The qualitative data reveals whether low scores reflect fundamental concept weakness or execution issues that can be addressed.
- Does a specific segment respond strongly? A concept that polarizes (loved by some, rejected by others) may have a viable niche even if the aggregate score is mediocre.
- How does this compare to our own testing history? Internal benchmarks, built over time, are the most reliable comparison available.
For guidance on designing concept tests that produce both quantitative and qualitative insight, see the complete guide to concept testing. For crafting questions that surface the “why” behind the scores, see concept testing questions.