← Reference Deep-Dives Reference Deep-Dive · 7 min read

Concept Testing Benchmarks: What Counts as a 'Good' Score

By Kevin, Founder & CEO

The Benchmark Trap


Every team that runs a concept test eventually asks the same question: “Is this score good?”

It is the wrong question, but it is understandable. Teams want a number that tells them to go or kill. They want a threshold that removes ambiguity. And an entire industry of research vendors has been happy to provide one, usually in the form of normative databases that promise to tell you whether your 42% top-2-box purchase intent is above or below the “norm.”

The problem is that these benchmarks are far less reliable than they appear, and over-reliance on them leads to worse decisions than having no benchmark at all.

Why Top-2-Box Scores Mislead


Top-2-box (T2B) scoring takes the percentage of respondents who selected the top two options on a scale (typically “definitely would buy” and “probably would buy”) and reports it as a single number. It is the default metric in concept testing, and it obscures more than it reveals.

The acquiescence problem. Survey respondents skew positive. In most concept tests, “probably would buy” captures a mix of genuine interest and polite non-rejection. The gap between “probably would buy” on a survey and actual purchase behavior is enormous, and it varies by category, culture, and how the question is framed.

Scale interpretation varies. A “4 out of 5” means different things to different people. Some respondents reserve 5 for concepts they find extraordinary. Others use 5 as their default for anything acceptable. Aggregating these into a single T2B number treats fundamentally different response patterns as equivalent.

Context dependence. The same concept will score differently depending on:

  • How it is presented (description vs. visual board vs. prototype)
  • What concepts were shown before it (contrast effects)
  • Whether the respondent was primed with a problem statement
  • The specificity of the target audience screening

A T2B score of 45% from a well-screened audience seeing a polished concept board is not comparable to 45% from a loosely screened panel seeing a text description.

Industry Benchmark Ranges


With those caveats firmly in place, here are directional benchmark ranges that research teams commonly reference. These are approximate and should not be used as go/no-go thresholds.

CategoryT2B Purchase Intent RangeNotes
CPG / FMCG30-55%Highly variable by subcategory; impulse categories skew higher
Consumer Tech20-40%Lower because purchase requires more consideration
B2B / SaaS15-35%Multi-stakeholder buying suppresses individual intent scores
Healthcare / Pharma20-40%Regulated claims limit what stimulus can promise
Financial Services15-30%Trust and switching costs dampen stated intent
Food & Beverage35-60%Highest T2B because trial cost is low

These ranges come from aggregated industry data and vary significantly by source. A concept scoring 25% T2B in B2B SaaS might be a strong result, while 25% in food and beverage would be cause for concern.

The Problem with Normative Databases


Large research firms maintain normative databases: collections of historical concept test results that serve as benchmarks. In theory, you test your concept, compare it to the database, and know whether you are above or below average.

In practice, normative databases have serious limitations:

Historical bias. Norms are built from past tests. If the database is dominated by concepts from 2015-2020, it reflects consumer expectations and competitive contexts that no longer exist. Categories evolve. What counted as innovative five years ago may be table stakes today.

Category mismatch. Your sparkling prebiotic beverage gets compared to a database “norm” that includes energy drinks, juice, and flavored water. The categories are different enough that the comparison is noise, not signal.

Methodological inconsistency. Were those historical tests run with the same scales? The same stimulus format? The same screening criteria? Usually not. Comparing your carefully designed test against a grab-bag of historical methodologies introduces error that the neat percentile ranking obscures.

Survivorship and selection effects. Normative databases over-represent concepts from large companies that test frequently. They under-represent scrappy startups and innovative concepts that break category conventions. Benchmarking against this skewed sample can penalize genuinely novel ideas.

The false precision problem. “Your concept scored in the 62nd percentile” implies a level of precision that the underlying data does not support. The confidence interval around that ranking is wide enough to make the difference between “above average” and “below average” statistically meaningless in many cases.

Building Internal Benchmarks


The most reliable benchmarks come from your own testing history. Internal benchmarks control for the variables that make external norms unreliable: your methodology, your audience, your stimulus format, your category.

How to Build Them

Step 1: Standardize your methodology. Use the same scales, the same stimulus format, and the same screening criteria across tests. Methodological consistency is the foundation. Without it, you are comparing apples to oranges even within your own data.

Step 2: Track every concept tested. Record not just the scores but the context: what stage was the concept at, what was the stimulus fidelity, how was the audience screened. This metadata matters for interpretation.

Step 3: Build your distribution. After 8-10 tests, plot your T2B scores. You will see your own range. After 20+, you will have a meaningful distribution with clear quartiles.

Step 4: Correlate with outcomes. This is where internal benchmarks become genuinely powerful. Track which concepts went to market and how they performed. Over time, you can identify the score ranges that predict commercial success in your specific context.

Step 5: Segment your norms. As your database grows, build separate benchmarks for different concept types (new products vs. line extensions), different audiences, and different stimulus formats.

The Timeline

Internal benchmarks are not instant. Here is a realistic ramp:

Tests CompletedBenchmark Reliability
1-5No benchmark; rely on qualitative signals
6-10Rough directional range; you know what “high” and “low” look like
11-20Meaningful quartiles; can rank new concepts against history
20+Robust internal norms with outcome correlation

At $20 per interview, the economics of building this testing history are accessible to teams of any size. A continuous testing program that evaluates 2-3 concepts per month builds a robust internal norm within a year.

When Benchmarks Help


Benchmarks are useful in specific situations:

  • Comparing concepts against each other. Relative ranking within the same test is always more reliable than comparing against an external norm.
  • Tracking over time. If your concepts are consistently improving (or declining) against your internal baseline, that trend is meaningful.
  • Setting minimum thresholds. After enough tests with outcome data, you can identify the floor below which concepts rarely succeed in-market. This is a useful kill criterion.
  • Communicating to stakeholders. Executives want a number. A percentile ranking against your internal database is a defensible way to provide one.

When Benchmarks Mislead


Benchmarks actively harm decision-making when:

  • They replace qualitative understanding. A concept that scores 35% T2B but generates passionate enthusiasm from a specific segment is often a better bet than one scoring 45% with lukewarm, undifferentiated appeal. The number does not capture this.
  • They become a bureaucratic gate. “Must score above the 60th percentile” as a rigid policy kills innovative concepts that challenge category conventions and initially confuse respondents.
  • They ignore the “why.” Two concepts can score identically for entirely different reasons. One resonates on the core benefit but confuses on pricing. The other is clear on pricing but has a weak benefit. The right next step is completely different, and the benchmark tells you nothing.
  • They create false confidence. A strong benchmark score does not mean the concept will succeed. Execution, pricing, distribution, timing, and competitive response all matter more than a pre-launch score.

Why Qualitative Depth Matters More Than the Number


The most valuable output of a concept test is not the score. It is the understanding of why people respond the way they do.

A 30-minute depth interview with 5-7 levels of probing reveals the reasoning, emotional response, comparisons, objections, and conditions under which a respondent would or would not engage with a concept. This qualitative depth tells you what to fix, what to amplify, and what to abandon in a way that no aggregate score can.

This is where AI-moderated concept testing changes the equation. Traditional quant-first concept testing gives you the score and leaves you guessing about the why. Depth interviews at scale give you both, with the qualitative insight to act on what the numbers mean.

A Better Framework


Instead of asking “Is this score good?”, ask:

  1. Is this concept better than the alternatives we tested? Relative comparison within a well-designed test is always more reliable than external benchmarks.
  2. Do respondents understand and articulate the core value proposition? If they can play it back in their own words, the concept communicates clearly.
  3. What are the specific objections, and are they fixable? The qualitative data reveals whether low scores reflect fundamental concept weakness or execution issues that can be addressed.
  4. Does a specific segment respond strongly? A concept that polarizes (loved by some, rejected by others) may have a viable niche even if the aggregate score is mediocre.
  5. How does this compare to our own testing history? Internal benchmarks, built over time, are the most reliable comparison available.

For guidance on designing concept tests that produce both quantitative and qualitative insight, see the complete guide to concept testing. For crafting questions that surface the “why” behind the scores, see concept testing questions.

Frequently Asked Questions

Top-2-box scores are inflated by social desirability—participants who find a concept inoffensive often respond positively rather than neutrally, because it feels impolite to say they would not buy something. Scores also vary substantially by category, stimulus quality, and how the question is framed. A 60% top-2-box score in a low-involvement category may represent weak performance while the same score in a complex B2B category may represent extraordinary appeal.
Normative databases aggregate scores across heterogeneous products, stimulus types, categories, and time periods in ways that make comparison unreliable. A concept in your category tested with your specific audience format has almost no meaningful benchmark in a database built from studies across dozens of categories and research approaches. The variance within normative databases is often larger than the difference between a 'good' and 'average' concept score.
Internal benchmarks are built by maintaining consistent testing methodology—same stimulus format, same question wording, same sample criteria—across multiple concepts over time. This creates a reference set of scores that are directly comparable because they were produced under identical conditions. The 60th percentile of your internal benchmark is a meaningful signal; the 60th percentile of a normative database built from inconsistent studies is not.
User Intuition's Intelligence Hub stores all historical study data with consistent metadata, enabling teams to track concept scores over time against their own testing history. Because the AI-moderated interview format is standardized across studies, scores are directly comparable—allowing teams to build internal benchmarks that reflect their specific category, audience, and stimulus approach rather than borrowing norms from an unrelated database.
Get Started

Put This Research Into Action

Run your first 3 AI-moderated customer interviews free — no credit card, no sales call.

Self-serve

3 interviews free. No credit card required.

Enterprise

See a real study built live in 30 minutes.

No contract · No retainers · Results in 72 hours