The Crisis in Consumer Insights Research: How Bots, Fraud, and Failing Methodologies Are Poisoning Your Data
AI bots evade survey detection 99.8% of the time. Here's what this means for consumer research.
Three core metrics reveal whether your product works. Here's how to measure them without drowning in data or missing what matt...

Most product teams track too many metrics or too few. They either drown in dashboards showing seventeen different engagement scores, or they ship based on gut feel because "usability is subjective." Neither approach works.
Three metrics cut through this confusion: task success rate, time on task, and the Single Ease Question (SEQ). Together, they answer the fundamental question every product team needs to answer: Can people actually use what we built?
These aren't the only usability metrics that matter. But they form a foundation that works across product types, team sizes, and organizational maturity levels. Research from the Nielsen Norman Group shows that teams using these three metrics consistently identify 85% of critical usability issues, while teams using ad-hoc measurement methods catch fewer than 40%.
Task success rate measures whether users complete what they set out to do. Someone tries to export a report, add a team member, or change their subscription. Did it work? Yes or no.
The simplicity is deceptive. Defining "success" requires precision about what counts and what doesn't. If someone exports a report but gets the wrong date range, is that success? If they complete checkout but abandon the cart before confirming, does that count?
Jeff Sauro's research on task completion metrics reveals that binary success rates (complete success vs. any failure) provide more reliable benchmarking data than partial credit systems. When teams introduce partial success scoring—awarding 0.5 for "mostly completed" tasks—inter-rater reliability drops by 34%. Different evaluators disagree about what constitutes "mostly."
The benchmark matters as much as the metric. A 73% task success rate sounds problematic until you learn that industry average for your product category is 68%. Context transforms interpretation.
Tracking task success over time reveals patterns that single measurements miss. A feature might launch with 82% success rate, drop to 71% after two weeks as edge cases emerge, then climb to 89% after targeted fixes. That trajectory tells a story about product maturity and team responsiveness.
Industry benchmarks vary by task complexity and user expertise. Consumer products targeting general audiences typically aim for 85-90% success rates on core tasks. Enterprise software serving trained users often accepts 75-80% for complex workflows.
These ranges reflect different user expectations and consequences of failure. Someone who can't figure out how to post a photo on a social app will try a competitor within minutes. An analyst who struggles with a complex data transformation in enterprise software will ask a colleague or consult documentation.
The mistake teams make is treating all tasks equally. Not every task deserves the same success threshold. Critical path tasks—those that deliver core product value—demand higher success rates than optional features or edge cases.
Time on task measures how long users take to complete an action. Faster usually means easier, but the relationship isn't linear.
A user who completes checkout in 45 seconds might be more satisfied than one who finishes in 30 seconds if the slower experience felt more confident and controlled. Speed becomes meaningful only when paired with success rate and satisfaction data.
Research from the Human Factors and Ergonomics Society demonstrates that time on task follows a log-normal distribution rather than a normal distribution. Most users cluster around a median completion time, but outliers take exponentially longer rather than proportionally longer. This distribution pattern affects how teams should calculate and interpret averages.
Using median time rather than mean time provides more stable metrics. When one user takes 847 seconds to complete a task that typically takes 90 seconds, that outlier shouldn't skew your entire dataset. The median stays anchored to typical user experience.
Some tasks should take longer. Reviewing terms of service, comparing pricing plans, or configuring security settings—these deserve deliberation. Optimizing for speed in these contexts optimizes for the wrong outcome.
The goal isn't minimum time. The goal is appropriate time for the task complexity and decision weight. A user who spends six minutes comparing subscription tiers and makes a confident choice delivers better business outcomes than a user who picks randomly in 30 seconds and churns next month.
Tracking time on task becomes most valuable when you establish task-specific baselines and monitor changes. If your median checkout time jumps from 67 seconds to 94 seconds after a redesign, something broke even if success rates stayed constant. Users are struggling somewhere in the flow.
The Single Ease Question asks users one thing immediately after task completion: "Overall, how difficult or easy was the task to complete?" They respond on a seven-point scale from "very difficult" to "very easy."
This single question correlates with task success, time on task, and likelihood to recommend at levels that surprise teams who expect usability measurement to require complex instrumentation. Research published in the Journal of Usability Studies shows SEQ scores correlate with task success at r=0.78 and with Net Promoter Score at r=0.71.
The timing matters enormously. Ask immediately after task completion, while the experience remains fresh and specific. Ask an hour later, and you're measuring memory of experience rather than experience itself. Those aren't the same thing.
SEQ works because it captures user perception of effort, which includes factors that behavioral metrics miss. Two users might both complete a task in 90 seconds with full success, but one felt confident throughout while the other experienced anxiety about whether they were doing it right. SEQ surfaces that difference.
Average SEQ scores above 5.5 indicate good usability. Scores between 4.0 and 5.5 suggest moderate friction worth investigating. Scores below 4.0 signal serious usability problems requiring immediate attention.
These thresholds come from analyzing thousands of task evaluations across product categories. They're not arbitrary cutoffs but empirically derived boundaries where user behavior changes meaningfully.
The distribution of responses often reveals more than the average. A task with an average SEQ of 5.2 but high variance—some users rating it 7, others rating it 2—indicates inconsistent experience. Some users find an easy path while others struggle. That pattern points to discoverability issues or unclear entry points rather than fundamental interaction problems.
The real insight emerges from triangulating across all three metrics. Each one tells part of the story. Together, they reveal patterns that single metrics miss.
Consider four scenarios:
High success rate, low time on task, high SEQ: This is the goal state. Users complete tasks quickly and feel good about it. Maintain this and focus optimization efforts elsewhere.
High success rate, high time on task, low SEQ: Users eventually succeed but struggle to get there. They're persisting through friction rather than flowing through the experience. This pattern often indicates unclear information architecture or missing feedback about progress.
Low success rate, low time on task, low SEQ: Users fail quickly and know they're failing. This typically points to broken functionality, misleading labels, or fundamental interaction model problems. These issues are usually easier to fix than they are to find—once you know where to look.
High success rate, high time on task, high SEQ: Users take longer but don't mind. This pattern appears in complex tasks where users expect to invest time and feel the interface supports their work. It's common in professional tools where thoroughness matters more than speed.
The practical challenge isn't understanding these metrics but capturing them consistently without adding research overhead that teams can't sustain.
Traditional usability testing measures all three metrics naturally. You observe users attempting tasks, note success or failure, record time from task start to completion, and ask SEQ immediately after. The data collection is straightforward.
The problem is scale and frequency. Lab-based usability testing typically involves 5-8 participants per study, conducted every few weeks or months. That sample size works for identifying major usability issues but provides insufficient data for tracking metrics over time or comparing performance across user segments.
Remote unmoderated testing expands reach but introduces measurement complexity. Without a researcher present, defining task start and end points becomes ambiguous. Users might pause to check email, get confused about instructions, or abandon and restart. Automated time tracking captures all of this noise.
AI-moderated research platforms like User Intuition solve this by conducting natural conversation-based interviews at scale while maintaining methodological rigor. The platform tracks task success through conversation analysis, measures effective time on task by identifying actual engagement periods, and collects SEQ ratings in context. Teams get the depth of traditional research with the scale of quantitative measurement.
How many users do you need to test to get reliable metrics? The answer depends on what you're trying to learn and how precise you need to be.
For identifying major usability problems, Nielsen's research shows that five users uncover approximately 85% of issues. But five users provide insufficient data for reliable quantitative metrics. Task success rates based on five observations have confidence intervals too wide for meaningful comparison.
Jeff Sauro's research on sample size for usability metrics demonstrates that 20 users provide stable task success rates with confidence intervals of ±10-15%. That precision suffices for most product decisions. Forty users narrow confidence intervals to ±7-10%, which matters when comparing similar design alternatives or tracking small changes over time.
Time on task requires larger samples because of higher variance. Twenty users provide reasonable estimates, but 30-40 users deliver more stable medians less susceptible to outlier influence.
SEQ scores stabilize quickly. Fifteen responses typically provide sufficient precision for comparing design alternatives or tracking trends.
The most interesting insights often emerge when metrics point in different directions. High success rate but low SEQ scores. Fast completion times but moderate success rates. These contradictions reveal nuance that single metrics miss.
When success rate is high but SEQ is low, users are succeeding despite the interface rather than because of it. They're applying determination and prior knowledge to overcome poor design. This pattern is particularly common in enterprise software where users have no choice but to persist.
The risk is complacency. Teams see high success rates and assume usability is fine. But users are accumulating frustration with every interaction. That frustration doesn't show up in task success metrics until competitors offer easier alternatives.
When time on task is low but success rate is also low, users are giving up quickly. They're not persisting long enough to figure out the interface. This pattern often indicates unclear value proposition or missing onboarding rather than interaction design problems.
Industry benchmarks provide context, but your own historical data matters more. Tracking these three metrics consistently over time reveals whether your product is getting easier to use or accumulating friction.
Establish baselines before major releases. Measure task success, time on task, and SEQ for core workflows. Then measure again after shipping changes. The delta tells you whether you improved usability or degraded it.
This approach works even with small sample sizes. If task success drops from 84% to 71% after a redesign, that signal is meaningful even if your confidence intervals are wide. You don't need statistical significance to know something broke.
The discipline of consistent measurement matters more than the precision of individual measurements. Teams that track these metrics monthly, even with modest sample sizes, develop intuition about normal variance and meaningful change. Teams that measure sporadically struggle to interpret results because they lack context.
Task success, time on task, and SEQ measure usability effectiveness. They don't measure whether you built the right features, whether users find value, or whether your product solves meaningful problems.
A feature can have excellent usability metrics and still fail in market. Users complete tasks quickly and easily, but they don't care about those tasks. The interface works well for solving problems users don't have.
These metrics also don't explain why usability problems exist. They reveal that users struggle with a particular task, but they don't diagnose root causes. Is the problem unclear labeling? Missing affordances? Incorrect mental model? Wrong interaction pattern?
Answering those questions requires qualitative research. Watch users attempt tasks. Listen to their thinking-aloud protocols. Conduct follow-up interviews about their experience. The quantitative metrics tell you where to look. Qualitative research tells you what you're looking at.
Most teams fail at usability measurement not because they choose wrong metrics but because they don't build sustainable practices around measurement.
Start with one critical task. Don't try to measure everything. Pick the task that matters most to your product's value proposition. For a project management tool, that might be creating and assigning a task. For an analytics platform, it might be building a custom report.
Measure that one task consistently. Establish a baseline. Track changes over time. Learn what good looks like for your specific product and users.
Then expand gradually. Add a second critical task. Then a third. Build measurement into your development workflow rather than treating it as a separate research initiative.
The goal isn't comprehensive measurement. The goal is consistent signal about whether your product is getting easier to use. Three well-measured tasks tracked quarterly provide more value than fifteen tasks measured once.
Metrics become valuable when they change decisions. If your team would ship the same features and designs regardless of what task success rates show, you're measuring for measurement's sake.
Create decision rules before measuring. If task success drops below 75%, we investigate before shipping. If median time on task increases by more than 20%, we conduct follow-up research. If SEQ scores fall below 4.5, we redesign.
These thresholds should reflect your product context and user expectations. A consumer social app might set higher bars than an enterprise data tool. The specific numbers matter less than having explicit standards that trigger action.
Share metrics in formats that drive discussion. Don't just report that task success is 78%. Show the trend over the last six months. Compare to your target. Highlight which user segments struggle most. Connect metrics to business outcomes like conversion rates or support ticket volume.
Task success, time on task, and SEQ provide a foundation for understanding usability. They're not the only metrics that matter, but they're the ones that matter most consistently across product types and team contexts.
The real work isn't learning to calculate these metrics. It's building organizational practices that capture them consistently, interpret them honestly, and act on them systematically.
Teams that master this create a feedback loop between user experience and product development. They ship changes, measure impact on usability, learn from results, and improve. Over time, this cycle compounds into products that feel effortless to use.
That effortlessness doesn't happen by accident. It emerges from disciplined measurement of whether people can actually use what you built. These three metrics tell you whether you're moving in the right direction.