Search for “concept testing template” and you will find survey checklists. Likert scales wrapped in a PDF. “Rate this concept from 1 to 5 on the following dimensions.” These templates produce a number. They do not produce understanding.
This template is different. It is a framework for running concept tests through depth interviews — structured conversations where consumers spend 10-20 minutes reacting to your concept through open-ended dialogue, with 5-7 levels of laddering on every meaningful response. The output is not a top-2-box score. The output is a clear picture of why consumers react the way they do, what specifically drives appeal or rejection, and whether the concept should move forward, be refined, or be killed.
The framework has five parts: a study design checklist, a discussion guide template, stimulus presentation formats for different testing designs, an evaluation rubric that captures both signal and reasoning, and a go/refine/kill decision matrix that translates findings into action. Each part is designed to work independently or as a complete system. Use the pieces you need.
This template is built from thousands of AI-moderated consumer interviews run on the User Intuition platform. It reflects what actually works when the goal is actionable intelligence rather than research theater.
Why Most Concept Testing Templates Fail?
The concept testing template market has a structural problem. The templates that rank highest in search results are designed for the simplest possible research format: online surveys. They give you a list of questions, a rating scale, and instructions for calculating a score. This is efficient. It is also nearly useless for making decisions about concepts that matter.
Survey templates fail for three specific reasons.
They produce scores without explanations. A concept that scores 4.2 out of 5 on “appeal” tells you that consumers generally found it appealing. It does not tell you what about the concept drove that reaction, whether the appeal is rooted in novelty or genuine need, or whether the appeal would survive contact with a store shelf, a price tag, or a competitor. When the concept underperforms in market — and concepts with strong survey scores underperform in market constantly — the survey data offers no diagnostic value. You know the score dropped. You do not know why.
They assume a moderator who may or may not probe. Focus group templates are better than survey templates because they at least acknowledge that concept reactions need to be explored. But they depend entirely on the moderator’s skill and consistency. A moderator who probes five levels deep on “I like the packaging” will uncover the emotional driver behind that reaction. A moderator who accepts “I like the packaging” at face value will not. Most concept testing templates provide questions but no probing structure, which means the depth of insight varies wildly across sessions and moderators. When you run concept testing vs focus groups, this moderator variability is one of the biggest threats to data quality.
They ignore the decision-making gap. This is the failure that costs the most money. Almost no concept testing template addresses what happens after the data comes in. Teams collect concept test results, present them in a deck, and then argue about what the results mean. The argument is never about the data — it is about interpretation. “Consumers liked it” versus “consumers were polite about it.” “The purchase intent was strong” versus “purchase intent is always inflated in research.” Without a predetermined decision framework, concept test results become ammunition for whoever has the strongest opinion in the room.
The template below addresses all three failures. The discussion guide produces explanations, not just scores. The probing structure is built in, so depth does not depend on moderator talent. And the decision matrix forces teams to define success criteria before the study runs — so results are evaluated against standards, not debated subjectively.
What Is the Five-Part Concept Testing Framework?
This framework is modular. You can use all five parts as an end-to-end system, or pull individual parts into your existing research process.
Part 1: Study Design Checklist — The strategic foundation. Defines what you are testing, who should evaluate it, what success looks like, and how results will be used. Completing this checklist before anything else prevents the two most common concept testing mistakes: testing the wrong concept with the wrong audience, and running a study without knowing what the results need to accomplish.
Part 2: Discussion Guide Template — The interview structure. Nine sections of open-ended questions with built-in laddering prompts that move from surface reaction to underlying motivation. Designed for AI-moderated depth interviews but adaptable to any conversational format.
Part 3: Stimulus Presentation Formats — The testing design. Four formats (monadic, sequential monadic, comparative, and protomonadic) with guidance on when to use each, how to structure the interview flow, and how to control for order effects and comparison bias.
Part 4: Evaluation Rubric — The analysis framework. Five core dimensions with both quantitative signals and qualitative evidence capture. Designed for segment-level analysis so you can see not just what consumers think, but which consumers think it and why.
Part 5: Go/Refine/Kill Decision Matrix — The action framework. Translates evaluation results into clear recommendations using threshold criteria your team defines before the study. Eliminates the subjective debate that wastes most concept test findings.
Part 1: Study Design Checklist
Every concept test that fails to produce a clear decision can trace the failure back to study design. Either the objective was vague, the audience was wrong, the stimuli were incomplete, or there was no definition of what “good” looks like. This checklist prevents those failures.
1. Define the Decision
Before designing anything else, write one sentence that describes the decision this concept test will inform. Not the research question — the business decision.
- “We will decide whether to move Concept A into packaging design or go back to ideation.”
- “We will choose between three naming directions for the Q3 launch.”
- “We will determine whether the updated positioning resonates with lapsed buyers.”
If you cannot write this sentence, you are not ready to run a concept test. The study design, discussion guide, and decision matrix all flow from this statement. Without it, you are collecting data without a purpose.
2. Define the Audience
Who should evaluate this concept? The answer is not “general consumers” — it is the specific segment whose reaction matters for the decision you defined above.
| Audience Parameter | What to Specify |
|---|---|
| Category users | Current buyers in the category, lapsed buyers, or non-buyers? |
| Brand relationship | Current customers, competitor customers, or brand-unaware? |
| Demographics | Age, income, geography — only if relevant to the decision |
| Behavioral filters | Purchase frequency, channel preference, usage occasion |
| Segment splits | Do you need to compare reactions across segments? Define them now |
For concept test sample size guidance, the general rule is 50-100 interviews per concept for monadic designs, 100-200 total for comparative designs with segment splits. AI-moderated interviews make these numbers practical within 48-72 hours at a fraction of traditional cost.
3. Define the Stimuli
What exactly will consumers see, hear, or read? Stimulus quality directly determines data quality. Be explicit about:
- Format: Written concept board, visual mockup, video prototype, name-only, packaging render
- Fidelity level: Rough sketch, polished mockup, production-ready asset
- Context framing: Will you explain the category context, or let the concept speak for itself?
- What is included: Product description, benefit claims, pricing, imagery, brand name
- What is deliberately excluded: Elements you want to test separately in later phases
A common mistake is testing a concept that includes too many variables. When a concept board combines a new product idea, a new brand name, a specific price point, and three benefit claims, you cannot isolate which element drove the reaction. Test one variable at a time, or design your analysis to account for multi-variable stimuli.
4. Define Success Criteria
What thresholds separate “go” from “refine” from “kill”? Define these before the study runs, not after you see the results. Examples:
- Go: 70%+ of target segment finds concept appealing with reasons tied to a genuine unmet need; purchase intent at 7+ out of 10 with behavioral justification
- Refine: 50-70% appeal with clear, consistent feedback on what to improve; barriers are addressable
- Kill: Below 50% appeal, or appeal is driven by novelty rather than need; barriers are structural
Your thresholds will differ by industry, concept type, and organizational risk tolerance. The point is to have them written down before data collection starts.
5. Define Timeline and Sample
| Parameter | Specification |
|---|---|
| Total interviews | Number per concept, per segment |
| Timeline | Study launch to final deliverable |
| Analysis needs | Segment comparisons, cross-tabulations, verbatim themes |
| Deliverable format | Report, dashboard, decision matrix, raw transcripts |
With AI-moderated interviews, the typical timeline from study design to actionable findings is 48-72 hours. Traditional approaches take 4-6 weeks for the same scope.
Part 2: Discussion Guide Template
This discussion guide is designed for depth interviews — one consumer, one conversation, 15-30 minutes per concept. Each section includes primary questions and laddering prompts that push past surface reactions to underlying motivations. The guide follows a deliberate arc: from unaided first reaction through comprehension, appeal, relevance, differentiation, purchase intent, and improvement.
If you are working with a bank of concept testing questions, this guide provides the structure for deploying them effectively.
Section 1: Opening and Context Setting
The opening establishes the conversational frame without leading the participant. Do not describe the concept. Do not explain what you are testing. Do not prime the consumer with category context unless your study design requires it.
Primary questions:
- “I am going to show you something and I would like to get your honest reaction. There are no right or wrong answers — I am interested in what you genuinely think and feel.”
- “Before we look at anything, tell me a bit about how you currently [relevant category behavior]. What does that look like for you day to day?”
The second question establishes a behavioral baseline. When the consumer later reacts to your concept, you can anchor their reaction to their actual experience rather than an abstract opinion.
Section 2: First Reaction (Unaided)
This is the most important moment in the interview. The first five seconds of exposure produce the most honest signal. Capture it before the consumer has time to construct a considered response.
Primary questions:
- “Take a look at this. What is the very first thought that comes to mind?”
- “What is your gut reaction?”
Laddering prompts:
- “Tell me more about that.”
- “What about it made you think/feel that?”
- “You said [their word]. What does [their word] mean to you in this context?”
Do not rush past this section. The first reaction often contains the diagnostic signal that the entire concept test hinges on. A consumer who says “oh, this reminds me of [competitor]” in the first three seconds is telling you something about category positioning that no amount of structured questioning will surface as cleanly.
Section 3: Concept Comprehension
Before you can evaluate appeal, you need to confirm the consumer actually understands what the concept is and does. Comprehension failures are the most common — and most fixable — finding in concept tests.
Primary questions:
- “In your own words, what is this? What does it do?”
- “Who do you think this is for?”
- “Is there anything about this that is confusing or unclear?”
Laddering prompts:
- “What makes you say it is for [their answer]?”
- “When you say confusing, what specifically is unclear?”
- “What would you need to see or know to make it clearer?”
If more than 30% of your sample misunderstands the concept, the concept has a communication problem regardless of how it scores on appeal. Fix comprehension first.
Section 4: Appeal Exploration with Laddering
This is where depth interviews diverge from surveys entirely. A survey asks “how appealing is this concept on a scale of 1-5.” An interview asks why — and keeps asking until the actual driver is visible.
Primary questions:
- “What, if anything, do you find appealing about this?”
- “What, if anything, do you find unappealing?”
- “What stands out to you the most — positively or negatively?”
Laddering sequence (5-7 levels):
- “You said [specific element] is appealing. What about it appeals to you?”
- “What does that [quality/feature] mean to you personally?”
- “Why is that important to you?”
- “How does that connect to what you are looking for in [category]?”
- “When you think about [category] products that have worked well for you, how does this compare on that dimension?”
- “If this [quality] were missing, how would that change your reaction?”
- “Is this something you actively look for, or something you did not realize mattered until you saw it here?”
The laddering sequence moves from attribute (“I like the packaging”) to consequence (“it looks premium”) to value (“I want products that reflect my taste”) to behavior (“I would pick this up off the shelf because it stands out”). Each level adds specificity that the previous level lacked.
Section 5: Clarity Assessment
Separate from comprehension. Comprehension asks whether the consumer understands what the concept is. Clarity asks whether the concept communicates its value proposition effectively.
Primary questions:
- “What is this concept promising you? What benefit would you get?”
- “Is the benefit clear, or did you have to work to figure it out?”
- “Is there anything about this concept that feels like it is trying too hard, or not saying enough?”
Laddering prompts:
- “What specifically made it hard to figure out?”
- “If you had to explain this to a friend, how would you describe it?”
- “What is missing that would make the promise more believable?”
Section 6: Relevance and Need Fit
Appeal without relevance is entertainment. This section determines whether the concept solves a problem the consumer actually has.
Primary questions:
- “Is this something you would actually use in your life? Why or why not?”
- “Does this solve a problem you currently have, or address a need you recognize?”
- “How are you handling [this need] today? What are you currently using or doing?”
Laddering prompts:
- “When you say you would not use it, is that because the problem does not exist for you, or because this is not the right solution?”
- “What would have to be true for this to be relevant to you?”
- “How big of a problem is [need] for you? Is it something you think about, or something that barely registers?”
Section 7: Uniqueness vs. Alternatives
This section reveals competitive positioning from the consumer’s perspective — which is often different from the team’s internal framing.
Primary questions:
- “Does this feel different from what is already available, or does it feel similar to something you have seen before?”
- “What does this remind you of? What comes to mind?”
- “If this did not exist, what would you use instead?”
Laddering prompts:
- “What specifically makes it feel different (or similar)?”
- “Is that difference meaningful to you, or is it a distinction without a difference?”
- “Would the difference be enough to make you switch from what you currently use?”
Section 8: Purchase Intent and Barriers
Purchase intent questions in isolation are unreliable. Anchored in behavioral context after exploring appeal, relevance, and alternatives, they become significantly more diagnostic.
Primary questions:
- “Imagine this was available where you normally shop. Walk me through what would happen when you saw it.”
- “What would you need to know, see, or believe before you would buy this?”
- “What might stop you from buying this, even if you found it appealing?”
Laddering prompts:
- “You mentioned [barrier]. How significant is that? Is it a dealbreaker or a hesitation?”
- “If that barrier were removed, would you buy it? What would that look like?”
- “At what price would this feel like a no-brainer? At what price would you walk away?”
Section 9: Improvement Suggestions
This section works best at the end, after the consumer has fully articulated their reaction. Earlier in the interview, improvement suggestions tend to be superficial. After 15-20 minutes of depth exploration, consumers offer more specific and actionable feedback.
Primary questions:
- “If you could change one thing about this concept to make it more appealing to you, what would it be?”
- “Is there anything missing that you would expect to see?”
- “If you were designing this for someone like you, what would you do differently?”
Adaptation Notes by Concept Type
The core discussion guide structure above applies to all concept types. Adjust the emphasis based on what you are testing:
- Product concepts: Emphasize Sections 4 (appeal) and 6 (relevance/need fit). The central question is whether the product solves a real problem.
- Packaging design: Emphasize Sections 2 (first reaction) and 7 (uniqueness). Packaging is processed in seconds — the unaided reaction and competitive differentiation are the primary signals.
- Messaging and positioning: Emphasize Sections 3 (comprehension) and 5 (clarity). The central question is whether the message lands as intended.
- Ad creative: Emphasize Sections 2 (first reaction) and 4 (appeal). Add questions about emotional response: “How did this make you feel?” with laddering on the specific emotion.
- Naming: Emphasize Section 3 (comprehension) with naming-specific questions: “What comes to mind when you hear this name?” and “What kind of product/brand do you imagine?”
- Pricing: Emphasize Section 8 (purchase intent) with value framing: “At this price, what would you expect?” and “Does the price match what you think this is worth?”
Part 3: Stimulus Presentation Formats
How you present concepts to consumers is as important as what you ask about them. The presentation format introduces structural biases that affect every data point in the study. Choosing the right format is a methodological decision, not a logistical one.
For a detailed comparison of the two most common formats, see our guide on monadic vs sequential concept testing.
Monadic Design
What it is: Each consumer evaluates one concept and only one concept. No comparisons, no alternatives, no anchoring.
When to use it:
- You need unanchored reactions — what consumers think of this concept on its own merits
- You are testing a single concept for go/no-go
- You want to avoid comparison bias (where a weak concept makes an adjacent concept look stronger than it is)
- You have enough sample to assign separate groups to each concept
Interview flow:
- Context setting and category baseline (Section 1)
- Stimulus presentation — single concept
- Full discussion guide (Sections 2-9)
- Wrap-up
Sample allocation: Minimum 50 interviews per concept. For segment-level analysis, minimum 30 per segment per concept.
Advantage: Purest signal. Consumer reactions are uncontaminated by comparison effects.
Limitation: No direct preference data. You compare concepts by comparing their independent scores across separate samples.
Sequential Monadic Design
What it is: Each consumer evaluates multiple concepts, one at a time, in a randomized order. Each concept gets the full evaluation before the next is introduced.
When to use it:
- You need direct within-subject comparison data
- You are evaluating 2-4 concept variations and need to understand relative preference
- Your sample size is limited and you cannot afford separate monadic cells
Interview flow:
- Context setting and category baseline (Section 1)
- First concept presentation (randomized assignment)
- Full discussion guide (Sections 2-9) for first concept
- Transition: “Now I am going to show you something different.”
- Second concept presentation
- Full discussion guide for second concept
- (Repeat for additional concepts)
- Comparative questions: “Now that you have seen all of them, which resonated most with you and why?”
Sample allocation: Minimum 100 total interviews. Randomize concept order to control for primacy and recency effects.
Advantage: Direct preference data within the same consumer. Efficient use of sample.
Limitation: Order effects are real. The first concept evaluated always has a slight advantage (primacy) or disadvantage (anchoring), depending on the category. Randomization mitigates but does not eliminate this.
Comparative Design
What it is: Multiple concepts presented simultaneously for side-by-side evaluation. Consumers see all concepts at once and react to them as a set.
When to use it:
- You are choosing between final candidates and need a direct preference ranking
- The concepts are similar enough that side-by-side comparison is natural (e.g., three packaging designs, two taglines)
- You have already run independent evaluations and need a tiebreaker
Interview flow:
- Context setting and category baseline (Section 1)
- All concepts presented simultaneously
- “Look at all of these. Which one draws your attention first? Why?”
- Comparative evaluation: appeal, clarity, and uniqueness across all concepts
- Ranking with justification: “If you had to choose one, which would it be? Walk me through why.”
- Element-level comparison: “Is there anything from Concept B that you wish Concept A had?”
Sample allocation: Minimum 75 interviews. All consumers see all concepts.
Advantage: Forced choice produces clear preference data. Element-level comparison reveals optimization opportunities.
Limitation: Comparison anchoring distorts absolute reactions. A mediocre concept looks good next to a weak one. Never use comparative as your only testing format.
Protomonadic Design
What it is: A two-phase design. Phase 1 is monadic — each consumer evaluates one concept independently. Phase 2 is comparative — after the monadic evaluation, all concepts are revealed for direct comparison. This is the gold standard for concept testing that matters.
When to use it:
- High-stakes decisions where you need both unanchored reactions and direct preference data
- You have budget for a larger sample
- You want to measure whether comparison changes initial preferences (a signal of concept vulnerability)
Interview flow:
- Context setting and category baseline (Section 1)
- Phase 1: Monadic evaluation of assigned concept (Sections 2-9)
- Transition: “Now I am going to show you some alternatives.”
- Phase 2: Reveal all concepts. “Does seeing these change your reaction to the first one? How?”
- Comparative ranking with justification
- Preference stability check: “Would you still choose the same one you initially preferred?”
Sample allocation: Minimum 75 interviews per concept (each consumer is assigned one concept for monadic phase, then sees all in comparative phase).
Advantage: Measures both independent merit and competitive robustness. A concept that scores well monadically but loses in comparison has a positioning problem. A concept that scores moderately monadically but wins in comparison has latent strength that context activates.
Limitation: Longer interviews. Higher cost per respondent. Worth it for concepts that will receive significant investment.
Best Practices for Stimulus Materials
Regardless of presentation format, stimulus quality determines data quality.
- Match fidelity to the decision. Early-stage screening can use text descriptions. Finalist evaluation should use polished visual mockups at minimum.
- Control variables. If you are testing the concept, hold the design constant. If you are testing the design, hold the concept constant.
- Include context where it matters. A packaging concept tested in isolation performs differently than a packaging concept shown in a simulated shelf environment. Match the testing context to the real-world context.
- Avoid production-quality bias. Consumers rate polished stimuli higher regardless of concept strength. If you are testing concept merit, use a consistent medium-fidelity format.
Part 4: Evaluation Rubric
The evaluation rubric translates raw interview data into structured assessment across consistent dimensions. This is where most ad hoc concept testing falls apart — without a rubric, analysis drifts toward the most memorable quotes rather than the most representative patterns.
Core Evaluation Dimensions
Every concept test, regardless of industry or concept type, should evaluate these five dimensions:
| Dimension | What It Measures | Key Interview Signals |
|---|---|---|
| Appeal | Does the concept attract interest and positive emotional response? | First reaction valence, enthusiasm language, spontaneous interest in learning more |
| Clarity | Does the concept communicate its value proposition clearly? | Accurate unaided playback, ability to explain to others, absence of confusion |
| Relevance | Does the concept address a real need the consumer recognizes? | Connection to current behavior/pain points, expressed willingness to change current solution |
| Uniqueness | Does the concept feel different from existing alternatives? | Comparison language, perceived differentiation, “I have not seen this before” signals |
| Purchase Intent | Would the consumer actually buy/use this? | Behavioral scenario specificity, barrier severity, price sensitivity signals |
Capturing Quantitative Signals and Qualitative Reasoning
Each dimension should be evaluated on two levels:
Quantitative signal: A directional score (strong/moderate/weak, or a 1-10 scale) that allows comparison across concepts and segments. In AI-moderated interviews, these signals are derived from the conversation content and behavioral indicators, not from a rating scale question. For instance, purchase intent is scored based on the specificity and realism of the purchase scenario the consumer describes, not based on them saying “8 out of 10.”
Qualitative evidence: The reasoning behind the score. Direct quotes, laddering chains, and thematic patterns that explain why the score is what it is. This is the layer that surveys cannot produce and that makes concept test findings actionable.
A sample rubric entry looks like this:
| Dimension | Signal | Evidence | Segment |
|---|---|---|---|
| Appeal | Strong | ”This is exactly what I have been looking for” — laddering reveals frustration with current category options, concept addresses specific unmet need around [X]. 14 of 20 participants in segment used language indicating active interest. | Women 25-34, category heavy buyers |
| Appeal | Weak | ”It is fine, nothing wrong with it” — laddering reveals no emotional engagement, concept is processed as category-generic. 8 of 15 participants could not identify a specific differentiator. | Men 35-44, category light buyers |
Segment-Level Analysis
Aggregate scores hide the insights that matter most. A concept with 60% overall appeal might have 85% appeal among your core target and 30% appeal among everyone else — which is a strong result, not a mediocre one.
Define your segments in the study design checklist (Part 1) and evaluate every dimension at the segment level. Common segmentation axes for concept testing:
- Demographic: Age, gender, income, geography
- Behavioral: Category purchase frequency, brand loyalty, channel preference
- Need state: What problem they are solving, how urgently they need a solution
- Attitudinal: Price sensitivity, openness to new products, brand trust levels
AI-moderated interviews at scale make segment-level analysis practical. When you are running 100-200 interviews across defined segments, you have enough depth within each cell to identify real patterns rather than anecdotal reactions. This is where the cost advantage of AI-moderated interviews compounds — running segment-level concept tests at traditional per-interview costs would require budgets that most teams cannot justify.
Part 5: Go/Refine/Kill Decision Matrix
The decision matrix is the most important part of this template and the part that most concept testing frameworks leave out entirely. Without it, concept test results become a Rorschach test — every stakeholder sees what they want to see.
Setting Threshold Criteria Before the Study
This is non-negotiable. If you define success criteria after seeing the data, you are not evaluating — you are rationalizing. Before the study launches, your team should agree on:
-
Which dimensions are must-haves vs. nice-to-haves. A concept might have weak uniqueness but strong relevance — is that a go or a refine? The answer depends on your strategy, and it must be decided before data exists to argue over.
-
What “strong,” “moderate,” and “weak” mean in each dimension. Anchor these to specific evidence types, not abstract scores. “Strong appeal” might mean “70%+ of target segment spontaneously expresses interest and can articulate a specific reason tied to an unmet need.” “Weak appeal” might mean “fewer than 40% express interest, or interest is driven by novelty rather than need.”
-
What combination of signals maps to each outcome. How many dimensions need to be “strong” for a go? How many “weak” signals trigger a kill?
The 3x3 Decision Matrix
| Strong (2+ key dimensions) | Moderate (mixed signals) | Weak (2+ key dimensions below threshold) | |
|---|---|---|---|
| Core target segment | GO — Move to next phase | REFINE — Address specific gaps | KILL — Fundamental mismatch |
| Adjacent segment | EXPAND — Concept has broader appeal than expected | MONITOR — Not a priority but worth tracking | EXPECTED — Concept was not designed for this segment |
| General population | MASS POTENTIAL — Rare; validate with larger sample | NICHE STRENGTH — Focus on core target | NICHE — Not a problem if core target is strong |
The matrix produces one of three outcomes:
Go: The concept meets or exceeds threshold criteria in the core target segment across the dimensions your team defined as must-haves. Move it to the next development phase. The qualitative evidence layer tells you what to protect — the specific elements that drove strong reactions — as the concept evolves through design, production, and marketing.
Refine: The concept shows promise but has specific, addressable gaps. The qualitative evidence tells you exactly what to fix. “Appeal is strong but clarity is weak” means the concept idea is right but the communication is wrong. “Relevance is strong but uniqueness is weak” means the need is real but the concept does not differentiate from existing solutions. Refine is only valid when the evidence points to specific, actionable changes — not when it means “test again and hope for better numbers.”
Kill: The concept fails to meet threshold criteria on must-have dimensions in the core target segment. The qualitative evidence tells you why — which is valuable for future concept development even though this specific concept is done. Killing a concept is not a failure of the innovation process. Killing it late — after packaging, production, and launch investment — is.
How to Handle “Refine” Outcomes
“Refine” is the most dangerous outcome because it is the most ambiguous. Teams that love a concept will interpret “refine” as “basically a go with minor tweaks.” Teams that are skeptical will interpret it as “basically a kill that nobody wants to call.”
Use the qualitative evidence to make “refine” specific:
- What exactly needs to change? Not “improve the messaging” but “the benefit claim about [X] does not land because consumers do not recognize [X] as a problem — reframe around [Y], which 80% of the target identified as a genuine pain point.”
- Is the change feasible? A concept that needs a fundamentally different product architecture is not a refine — it is a new concept. A concept that needs a different headline is a genuine refine.
- What does success look like after refinement? Define the thresholds for the re-test. If the refined concept does not meet them, it is a kill.
The Override Problem
The most common failure in concept testing is not bad research — it is teams overriding good research because they are emotionally invested in a concept. This happens in a predictable pattern: the concept tests poorly, someone on the team says “consumers just did not understand it,” and the team decides to launch anyway on the theory that the market will respond differently than the research predicted.
Sometimes this is correct. Usually it is not. The decision matrix exists to create accountability. When your threshold criteria say “kill” and your team says “launch,” make that override explicit and documented. Track the outcomes. Over time, you will learn whether your overrides add value or destroy it — and the data will make future decisions easier.
Adapting the Template by Use Case
The five-part framework is use-case agnostic. The dimensions, questions, and decision criteria shift based on what you are testing. Below are adaptation guidelines for the six most common concept testing scenarios.
Product Concept Testing
Primary dimensions: Relevance, Appeal, Purchase Intent
Discussion guide emphasis: Sections 4 (appeal with laddering) and 6 (relevance and need fit) get the most time. The central question is whether the product solves a problem that consumers recognize and care about.
Stimulus format: Written concept board with product description, key benefits, and usage scenario. Avoid pricing at this stage unless price is the core differentiator.
Decision matrix adjustment: Purchase intent carries more weight. A product concept with strong appeal but weak purchase intent is entertainment, not a product. Probe the barriers aggressively — if they are “I would need to try it first” rather than “I would never buy this,” the concept may be viable with the right trial strategy.
Packaging Design Testing
Primary dimensions: Appeal (first reaction), Uniqueness, Clarity
Discussion guide emphasis: Section 2 (first reaction) is everything. Packaging decisions happen in 2-3 seconds at shelf. Spend the most time understanding the instant reaction and what drives it.
Stimulus format: Monadic design is strongly recommended. Comparison anchoring distorts packaging reactions more than any other concept type. Show each design individually, then use a protomonadic reveal to test competitive shelf context.
Decision matrix adjustment: Uniqueness carries more weight than in other use cases. Packaging that is appealing but category-generic disappears on shelf. The qualitative evidence should specifically address shelf standout and brand communication at a glance.
Messaging and Positioning Testing
Primary dimensions: Clarity, Relevance, Appeal
Discussion guide emphasis: Sections 3 (comprehension) and 5 (clarity) are primary. Can consumers play back the message accurately? Does the positioning resonate with how they think about the category?
Stimulus format: Sequential monadic works well for messaging tests — you want direct comparison of how different messages land with the same consumer. Randomize order rigorously.
Decision matrix adjustment: Clarity is the gating dimension. A message that is appealing but unclear will not work in market. A message that is clear and relevant but not yet appealing can usually be refined — the strategic foundation is sound.
Ad Creative Testing
Primary dimensions: Appeal (emotional response), Clarity (message takeaway), Uniqueness (breakthrough potential)
Discussion guide emphasis: Section 2 (first reaction) and Section 4 (appeal) with a specific focus on emotional response. Add: “How did this make you feel?” with full laddering on the emotion itself.
Stimulus format: Monadic for early-stage creative evaluation. Comparative for finalist selection. Always test in the intended media context when possible — a social ad should be tested in a social media simulation, not as a standalone asset.
Decision matrix adjustment: Emotional resonance is the gating dimension. Creative that communicates clearly but generates no emotional response will not break through. The qualitative evidence should map the specific emotional pathway: what the consumer felt, what triggered it, and whether it connects to the brand.
Brand Naming Testing
Primary dimensions: Clarity (what the name communicates), Appeal (emotional/aesthetic reaction), Uniqueness (distinctiveness in category)
Discussion guide emphasis: Section 3 (comprehension) with naming-specific questions. “What comes to mind when you hear this name?” “What kind of product do you imagine?” “What kind of person would use something called this?”
Stimulus format: Sequential monadic with names presented in randomized order. Present names without additional context first, then with product/brand context, to measure how much work the name does on its own.
Decision matrix adjustment: Add a “risk” dimension. Names that generate negative associations, unintended meanings, or pronunciation difficulties in your target markets should be flagged regardless of appeal scores. Qualitative evidence is especially important here — a name that 80% like but 20% find offensive has a different risk profile than a name that 80% like and 20% find forgettable.
Pricing and Value Perception Testing
Primary dimensions: Relevance (at this price), Purchase Intent (at this price), Appeal (value perception)
Discussion guide emphasis: Section 8 (purchase intent and barriers) with extensive price probing. “At this price, what would you expect?” “Does the price match what you think this is worth?” “At what price would this be an obvious yes? At what price would you walk away?”
Stimulus format: Monadic with price included in the concept. Do not test price in isolation — test the concept at a specific price, because price perception is inseparable from product perception.
Decision matrix adjustment: Map purchase intent at multiple price points. The qualitative evidence should reveal the value framework consumers are using — are they comparing to category norms, to the specific benefit, or to the cost of their current solution? This determines pricing power.
Concept Testing Timeline
A concept test without a timeline drifts. Stakeholders lose urgency, analysis stretches into weeks, and the market window the concept was designed for closes while the team debates findings. This timeline creates accountability with clear milestones and owners for each phase.
Standard 3-Week Timeline
| Phase | Timeline | Key Activities | Owner | Deliverable |
|---|---|---|---|---|
| Design | Week 1 (Days 1-5) | Define the decision statement, finalize audience criteria, prepare stimulus materials, build discussion guide, set threshold criteria for the decision matrix, align stakeholders on success definitions | Research Lead + Brand/Product Manager | Completed study design checklist, finalized discussion guide, approved stimulus materials, documented decision matrix thresholds |
| Recruit + Field | Week 2 (Days 6-12) | Launch recruitment, screen participants, conduct AI-moderated interviews (50-200 conversations in 48-72 hours), monitor quality and completion rates | Research Lead | Completed interviews with target sample size achieved across all segments |
| Analyze + Present | Week 3 (Days 13-18) | Code responses across evaluation dimensions, populate segment-level rubric, apply decision matrix, prepare findings deck, present to stakeholders, document go/refine/kill recommendation | Research Lead + Stakeholder Group | Findings presentation, populated decision matrix, documented recommendation with supporting evidence |
Accelerated Timeline (48-72 Hours)
For teams that need results within a single sprint, AI-moderated interviews compress the timeline:
| Phase | Timeline | Key Activities | Owner |
|---|---|---|---|
| Design | Day 1 (morning) | Finalize study design checklist, configure discussion guide, prepare stimulus, launch study | Research Lead |
| Field | Day 1-2 | AI-moderated interviews run in parallel — 50-200 conversations complete asynchronously | Automated (platform) |
| Analyze + Present | Day 2-3 | AI-assisted analysis generates rubric scores and evidence clusters; researcher reviews, refines, and populates decision matrix | Research Lead |
The accelerated timeline is appropriate when the concept decision has a hard deadline (board meeting, production commitment, campaign launch) and the team has run at least one study using the full 3-week timeline to calibrate their process.
Milestone Checkpoints
Three non-negotiable checkpoints prevent the most common timeline failures:
-
End of Week 1: Design lock. The study design checklist is complete, the discussion guide is finalized, stimulus materials are approved, and the decision matrix thresholds are documented. No changes after this point — changing the study design mid-field invalidates your data.
-
Mid-Week 2: Recruitment and quality check. Verify that incoming participants match your screening criteria, interview quality meets depth expectations, and sample composition is tracking toward segment targets. Flag and address issues before the field phase completes.
-
End of Week 2: Data completeness. All interviews are complete, sample size targets are met across segments, and the data is ready for analysis. If targets are not met, decide whether to extend fieldwork (with a documented timeline impact) or analyze with the available sample (with documented limitations).
Stakeholder Mapping
Concept testing involves more stakeholders than most research — and each stakeholder uses the results differently. Mapping their roles before the study prevents the post-study debate about what the results mean and who gets to decide.
Who to Involve
| Stakeholder | Role in Concept Testing | When to Engage | How They Use Results |
|---|---|---|---|
| Innovation / R&D Team | Concept creators. They developed the idea and have the deepest understanding of its intent, technical feasibility, and strategic rationale. | Study design (to ensure the concept is represented accurately), findings review (to understand consumer reactions) | Refine the concept based on qualitative evidence. Understand which elements resonate and which need rework. Decide whether to iterate or move to the next concept. |
| Product Management | Decision owners for product concepts. They determine whether a concept moves to development, gets refined, or gets killed based on fit with the roadmap and resource availability. | Study design (to define the business decision), threshold setting (to define what “good enough” looks like), findings review and decision | Go/refine/kill decision. Roadmap prioritization. Resource allocation for concepts that move forward. |
| Marketing | Positioning and launch owners. They need to understand what resonates with consumers to build campaigns, messaging, and go-to-market strategies around concepts that move forward. | Findings review, post-decision planning | Messaging strategy built on the specific language and benefits consumers responded to. Launch positioning grounded in qualitative evidence. |
| Sales | Market-facing feedback. In B2B contexts, sales teams have direct insight into what prospects are asking for and what competitive alternatives they are evaluating. | Study design (to provide competitive context and customer feedback), findings review | Sales enablement materials grounded in concept test evidence. Competitive positioning informed by consumer reactions. |
| Executive Sponsors | Resource allocators. They approve the investment required to move concepts from testing to development and launch. | Threshold setting (to align on investment criteria), findings presentation | Investment decisions based on evidence rather than opinion. Portfolio-level view of concept pipeline strength. |
Responsibility Assignment
| Activity | Innovation/R&D | Product Manager | Marketing | Sales | Executive Sponsor |
|---|---|---|---|---|---|
| Develop concept and stimulus | Owns | Reviews | Consulted | Consulted | Informed |
| Define business decision | Consulted | Owns | Consulted | Consulted | Approves |
| Set decision matrix thresholds | Consulted | Owns | Consulted | Informed | Approves |
| Design discussion guide | Consulted | Consulted | Consulted | Informed | Informed |
| Review findings | Reviews | Owns interpretation | Reviews | Reviews | Reviews |
| Make go/refine/kill decision | Informed | Recommends | Consulted | Consulted | Decides |
| Act on results | Executes refinement | Owns roadmap change | Builds GTM plan | Updates pitch | Allocates resources |
The most important mapping is the decision right. Before the study runs, everyone should know who makes the final go/refine/kill call. If that is ambiguous, the decision matrix becomes a debate prompt rather than a decision tool.
How Do You Build Compounding Intelligence?
A single concept test using this template produces a clear go/refine/kill decision. That is valuable. But the real value emerges when you use the same template consistently across studies.
Why Consistency Matters
When every concept test uses the same five evaluation dimensions, the same discussion guide structure, and the same decision criteria framework, something changes: results become comparable across studies, across quarters, and across product lines. You stop asking “did this concept test well?” and start asking “how does this concept compare to every concept we have tested in the last two years?”
This transforms concept testing from a one-off project into a knowledge system. You learn what “strong appeal” actually looks like in your category. You calibrate your success thresholds based on real data about what predicts in-market performance. You identify patterns — maybe concepts that score high on relevance but moderate on uniqueness consistently outperform concepts with the opposite profile, which tells you something fundamental about how your consumers make decisions.
Cross-Study Pattern Recognition
After 10-20 studies using the same template, patterns emerge that no individual study could reveal:
- Which evaluation dimensions are the strongest predictors of in-market success for your category
- What language and framing consumers consistently respond to — and what they consistently reject
- How different segments diverge in their concept evaluation patterns, and whether those divergences are stable over time
- Whether your organization’s concept development process is improving — are later concepts scoring better on the dimensions that matter?
The Intelligence Hub
User Intuition’s Intelligence Hub is built for exactly this use case. Every study, every interview, every finding is searchable and cross-referenced. When you launch a new concept test, the platform surfaces relevant findings from previous studies automatically — so you are not starting from zero, you are building on everything your organization has already learned.
This is what we mean by intelligence that compounds. A concept test run in March becomes context for a concept test run in September. A finding about consumer reactions to sustainability messaging in your packaging study becomes a data point in your messaging strategy two quarters later. The template creates the structure. The platform creates the memory.
Getting Started
This template is designed to be used today. Download it, adapt it to your concept and category, and run the study.
If you want to run it at scale — 50-200 AI-moderated depth interviews with full laddering, automated analysis, and a populated decision matrix delivered in 48-72 hours at $20 per interview — start a study on User Intuition or book a walkthrough with our team.
The five-part framework works with any research methodology. It works better with a methodology that can actually execute depth conversations at scale — because the template is only as good as the data that fills it.