What is human preference data for unverifiable tasks?

Human preference data for unverifiable tasks is structured training data for AI systems whose performance depends on human judgment — taste, persuasion, design, consumer reasoning, professional intuition — rather than verifiable correctness. Unlike rubric-based regimes (code, math, structured reasoning), unverifiable tasks have no deterministic checker. User Intuition produces behaviorally grounded preference datasets in JSONL, Parquet, or transcript-bundle format for fine-tuning, evaluation, reward modeling, and synthetic data generation.

What data formats does User Intuition deliver?

Datasets ship in JSONL (one record per line), Parquet, or transcript bundles. Other formats are available on request. Each record carries the interview transcript, latent-variable ontology, forced-choice and willingness-to-pay responses, voice-derived implicit signal features, and behavioral co-registration data when the procurement mode supports it.

What licensing models are available?

Exclusive, semi-exclusive, and syndicated licensing are all available. Pricing scales with corpus size, exclusivity tier, and grounding depth — that is, how many of the four behavioral-grounding mechanisms (co-registration, longitudinal recontact, forced-choice instrumentation, voice-derived implicit signal) are applied to the study. Allowed uses include fine-tuning, evaluation, reward modeling, synthetic data generation, RAG, internal benchmarking, derivative datasets, and model distillation, bounded by the consent block on each record.

Does User Intuition retain ownership of the underlying corpus?

Yes. User Intuition retains ownership of the underlying corpus. Licensees receive structured derivative data under the terms of the agreement. Provenance is carried at the record level: study ID, protocol version, ontology schema version, and consent scope. Studies commissioned for human-preference-data licensing carry consent terms designed for that use; the legacy interview corpus is not retroactively re-licensed.

Does the licensed dataset include audio or video?

No. Transcripts ship; audio and video do not. Voice-derived features — response latency, hesitation markers, prosody, certainty scores — are extracted at processing time and surfaced as ontology fields on the transcript. Buyers requiring biometric audio for specific use cases can discuss separately under additional consent terms.

Human Preference Data

Human preference data for unverifiable tasks.

User Intuition produces behaviorally grounded preference datasets for AI systems whose performance depends on human judgment — taste, persuasion, design, consumer reasoning, professional intuition — not correctness.

Talk to us about a data partnership See sample output

JSONL · Parquet · transcript bundles
Behaviorally grounded
Exclusive licensing available

The Problem

Models trained on rubrics can't learn judgment.

Reinforcement learning with verifiable rewards has carried frontier models a long way through code, math, and structured reasoning. The training signal in those domains is cheap to produce and cheap to grade: a unit test passes or fails, a proof type-checks or it doesn't, an answer matches or it doesn't. The economics of that loop are the reason rubric-graded data is now a multi-billion-dollar market dominated by a small number of suppliers.

The same loop breaks for tasks where the answer lives in human judgment. A response can be technically correct and tonally wrong. A persuasive argument can be factually accurate and unmoving. A design recommendation can satisfy every stated constraint and still feel generic. Models trained only against rubrics learn to produce outputs that pass checks rather than outputs that resonate — sycophantic, consensus-seeking, confidently miscalibrated about what real humans think and do. Frontier labs are hitting this wall now as rubric-based gains exhaust. Enterprise agent teams have already hit it whenever an agent is technically correct and strategically off. There is no obvious supplier for the data layer that resolves it.

We've measured the inverse failure mode directly. In The Synthetic Mirage in Market Research, we ran 117 real voice interviews against 90 LLM-generated synthetic participants on the same protocol. Across three frontier models, the synthetic participants reproduced the language of qualitative research without reproducing the population variance that makes it useful — disengagement, refusal, lived friction, extreme outliers. The output looked legitimate. It did not carry the signal. Models imitate humans the same way they imitate experts: confidently and within distribution.

The Solution

An integrated stack for human preference data.

User Intuition operates an integrated stack for producing human preference data on unverifiable tasks. Three components, each useful on its own, but designed to compound.

Latent-variable qualitative methodology

Most preference data captures stated answers to direct questions. We use techniques developed inside academic and commercial qualitative research — laddering, jobs-to-be-done elicitation, the iceberg model — to extract the constructs that actually drive judgment: motivations, emotional drivers, identity claims, anticipated regret, perceived tradeoffs. The output is structured codes against a defined ontology, not free-text impressions.

AI-moderated voice infrastructure

Depth interviews historically scaled poorly: a senior moderator could run six to ten in a week, transcripts were uneven, coding was inconsistent. Our voice-moderated platform runs hundreds to thousands of interviews in parallel against a shared protocol, with consistent probing depth and uniform downstream coding. Voice features — latency, hesitation, prosody, certainty — are extracted at the interview layer and surfaced as ontology features. Audio itself does not ship.

Behavioral grounding

Stated preference and revealed preference diverge, often substantially. A model trained only on what people say predicts what people say — not what they choose, buy, prescribe, switch from, or stay with. Every dataset we license carries one or more behavioral-grounding mechanisms tying interview responses to observed behavior at the participant level. Detail follows below.

Behavioral Grounding

Four mechanisms, used in combination.

Behavioral grounding is the hardest of the three components to deliver and the one that most differentiates the dataset.

Behavioral co-registration

Available when the licensee provides the audience and the behavioral source. The most common configuration today is the licensee's own customer base — Shopify merchants connecting their store data, app makers connecting product analytics, brand teams connecting CRM data. User Intuition runs interviews with those customers under additional consent terms; behavioral data flows from the licensee's system at the participant level. The dataset carries paired records: interview transcript and the behavioral truth of what the same person did, before and after the interview.

Available today via Shopify integration and direct CRM connections.

Longitudinal recontact

Participants are interviewed about an intended behavior, then recontacted 30 / 60 / 90 days later to capture what actually happened. The dataset carries stated-then-revealed pairs across decision contexts. Available where the licensee provides the audience or the participant has been onboarded with explicit recontact consent.

Forced-choice and willingness-to-pay

Inside the interview, stated-preference questions are replaced with forced tradeoffs that surface what the participant actually values when constrained. Conjoint-style modules, willingness-to-pay anchoring, scarcity framing. Less rigorous than co-registration but lifts signal materially over open-ended preference questions, and runs at scale because it's instrumentation rather than data partnership.

Voice-derived implicit signal

Voice features — response latency, hesitation, prosody, certainty markers — are extracted during the interview and encoded as ontology fields on each utterance and code. Used as a confidence layer on stated answers, not a primary grounding source. Audio is processed at extraction time; only the derived features ship.

Sample Output

One illustrative record.

The fastest way to understand what we ship is to look at one record. Fields, ontology, and naming represent the production schema; participant data is fabricated. Production format is JSONL (one record per line); pretty-printed here for readability.

sample-record.json

{
  "record_id": "ui-2026-cpg-laundry-0247",
  "study": {
    "study_id": "study_laundry_brand_switching_2026q1",
    "domain": "cpg.household_care.laundry",
    "protocol_version": "v3.2",
    "fielded_window": ["2026-01-08", "2026-01-22"]
  },
  "participant": {
    "panel_member_id": "anon_3f9d8e",
    "screener": {
      "age_bucket": "35-44",
      "household_income_bucket": "$75k-$100k",
      "household_composition": "two_adults_two_children",
      "region_us": "midwest",
      "primary_grocery_purchaser": true,
      "category_spend_monthly_usd_bucket": "$15-$25"
    },
    "consent": {
      "training_use": true,
      "evaluation_use": true,
      "synthetic_generation_use": true,
      "redistribution_terms": "licensee_internal_only",
      "consent_form_id": "ui_consent_v4_2026"
    }
  },
  "interview": {
    "modality": "voice",
    "language": "en-US",
    "duration_seconds": 1842,
    "moderator": "ai",
    "interview_guide_id": "guide_laundry_v3.2"
  },
  "transcript": [
    {
      "t": 412.3,
      "speaker": "moderator",
      "utterance": "Walk me through the last time you switched laundry detergent brands."
    },
    {
      "t": 419.8,
      "speaker": "participant",
      "utterance": "Um, I think it was last summer? We were on Tide for a long time, and I'd been wanting to switch to something more sustainable.",
      "implicit_signal": {
        "latency_ms": 940,
        "hesitation_markers": ["um"],
        "certainty_score": 0.42
      }
    },
    {
      "t": 436.4,
      "speaker": "participant",
      "utterance": "Less plastic packaging. The detergent sheets — Earth Breeze, that kind of thing. I'd been seeing them on Instagram.",
      "implicit_signal": {
        "latency_ms": 180,
        "hesitation_markers": [],
        "certainty_score": 0.81,
        "prosody_emphasis": "Earth Breeze"
      }
    },
    {
      "t": 462.0,
      "speaker": "participant",
      "utterance": "I tried it for like a month. It was fine but it was — yeah, I went back to Tide eventually.",
      "implicit_signal": {
        "latency_ms": 380,
        "hesitation_markers": ["like", "yeah"],
        "certainty_score": 0.55
      }
    }
  ],
  "ontology": {
    "schema_version": "ui_ontology_v2.1",
    "jtbd": "keep household clothes clean and fresh within budget without spending mental effort on the decision",
    "decision_drivers": [
      { "code": "price_per_load", "weight_stated": 0.34, "weight_revealed": 0.71 },
      { "code": "cleaning_efficacy_perception", "weight_stated": 0.61, "weight_revealed": 0.58 },
      { "code": "sustainability_signal", "weight_stated": 0.62, "weight_revealed": 0.12 },
      { "code": "brand_familiarity", "weight_stated": 0.18, "weight_revealed": 0.49 }
    ],
    "latent_constructs": {
      "identity_sustainable_consumer": 0.74,
      "trust_in_legacy_brand": 0.66,
      "price_sensitivity_household_goods": 0.69,
      "anticipated_regret_efficacy": 0.58
    },
    "forced_choice": [
      {
        "task_id": "fc_sustainability_vs_price",
        "options": ["sustainable_brand_at_$0.32_per_load", "legacy_brand_at_$0.18_per_load"],
        "selected": "legacy_brand_at_$0.18_per_load",
        "elapsed_ms": 4200
      },
      {
        "task_id": "wtp_sustainability_premium",
        "anchor_usd": 12.99,
        "max_acceptable_usd": 14.49,
        "premium_accepted_pct": 11.5
      }
    ]
  },
  "behavioral_co_registration": {
    "source": "licensee_audience_shopify",
    "consent_scope": "ecommerce_event_data_24mo_lookback",
    "window_start": "2024-01-01",
    "window_end": "2026-01-22",
    "category_events": [
      { "ts": "2025-04-12", "event": "purchase", "sku": "tide_pods_72ct", "channel": "instacart", "price_usd": 23.99 },
      { "ts": "2025-05-28", "event": "view_pdp", "sku": "earth_breeze_eco_sheets_30ct", "channel": "amazon", "referrer": "instagram_ad" },
      { "ts": "2025-05-29", "event": "add_to_cart", "sku": "earth_breeze_eco_sheets_30ct", "channel": "amazon" },
      { "ts": "2025-06-02", "event": "purchase", "sku": "earth_breeze_eco_sheets_30ct", "channel": "amazon", "price_usd": 14.99 },
      { "ts": "2025-07-14", "event": "purchase", "sku": "tide_pods_72ct", "channel": "target_dotcom", "price_usd": 21.49 },
      { "ts": "2025-09-08", "event": "view_pdp", "sku": "earth_breeze_eco_sheets_60ct", "channel": "amazon" },
      { "ts": "2025-12-04", "event": "purchase", "sku": "tide_pods_72ct", "channel": "instacart", "price_usd": 22.99 }
    ],
    "stated_vs_revealed": {
      "stated_primary_driver": "sustainability_signal",
      "revealed_primary_driver": "brand_familiarity_at_lower_price_per_load",
      "stated_revealed_alignment_score": 0.21
    }
  },
  "longitudinal_recontact": {
    "wave_2_scheduled": "2026-04-22",
    "wave_2_outcome": null,
    "stated_intent_at_wave_1": "considering_switch_to_eco_brand_again_when_kids_older"
  }
}

A few things worth noticing.

The stated_vs_revealed_alignment_score of 0.21. The participant rated sustainability as a dominant decision driver in the interview. Her purchase history over the same window shows roughly 83% legacy-brand share by spend. The gap is the training signal.

The forced_choice block. Inside the interview, two structured tradeoffs replaced open-ended preference questions. She said sustainability mattered. When forced to choose at a 78% price premium, she picked the legacy brand. Forced-choice instrumentation runs inside every interview and is data-source-independent — it produces stated-revealed pairs in UI-panel mode where co-registration isn't available, and supplements grounding when it is.

The implicit_signal layer on each utterance. Latency, hesitation markers, certainty scores. Audio itself doesn't ship; the features derived from it do. Models can use these as a confidence layer on stated answers — high-certainty answers carry different weight than hesitant ones.

The consent block. Every record carries the consent scope it was collected under: training, evaluation, synthetic-generation, redistribution terms. Studies commissioned for human-preference-data licensing carry consent terms designed for that use; the legacy interview corpus is not retroactively re-licensed.

The category_events stream. In licensee-audience mode, event-level co-registration is available — page views, add-to-cart, purchases, abandons, returns — paired to interview transcripts at the participant level. The arc here (Instagram ad → PDP view → add-to-cart → trial purchase → revert → reconsidered three months later) is what makes co-registered data more useful than purchase-only panels: the model sees the full consideration funnel, not just the conversion.

Who It's For

Three buyer segments, one dataset.

The same record serves each segment, weighting different fields.

Frontier labs

Training and evaluation substrate for post-training on tasks where verifiable rewards don't apply. Reward modeling for unverifiable preferences. Evaluation sets where rubric-based graders saturate. Calibration data for sycophancy and consensus drift. The latent-construct ontology and stated-vs-revealed pairs are what matter; events and longitudinal recontact extend training signal beyond what comparison-based preference data can capture alone.

Enterprise agent teams

Fine-tuning and evaluation data for agents whose outputs are judged on taste, persuasion, or consumer/professional resonance rather than technical correctness. Sales agents that need to read a buyer's actual decision drivers. Concierge and shopping agents that need to ground recommendations in revealed preference, not stated. Agents in domains where being technically right and strategically off is the dominant failure mode.

Synthetic data & voice teams

A grounded foundation for generating scaled synthetic populations, voices, and reasoning traces. Synthetic outputs drift toward model priors when seeded only on other synthetic data. Seeding generation on real human cognition with paired behavioral data anchors the synthetic distribution to observed human variance — the failure mode documented in The Synthetic Mirage.

Procurement

What you license, how it's licensed.

First licensable corpus. Consumer purchase reasoning, CPG, and brand persuasion. Studies are commissioned on demand for new domains; existing infrastructure runs in 50+ languages and across professional and consumer panels.

Procurement modes. Two configurations determine which behavioral grounding mechanisms apply. UI-panel mode — User Intuition sources participants from its own panel — produces structured stated-preference data with forced-choice instrumentation and voice-derived implicit signal, useful for many evaluation and reward-modeling needs. Licensee-audience mode — the buyer provides the audience via Shopify, CRM, or app analytics integration — adds event-level co-registration and longitudinal recontact for buyers who need stated-revealed pairs and event-level signal alongside the stated-preference layer.

Data formats: JSONL, Parquet, transcript bundles. Other formats on request.
Allowed uses: Fine-tuning, evaluation, reward modeling, synthetic data generation, retrieval-augmented generation, internal benchmarking, derivative datasets, model distillation. Use is bounded by the consent block on each record.
Licensing models: Exclusive, semi-exclusive, or syndicated. Pricing scales with corpus size, exclusivity, and grounding depth (number and type of behavioral mechanisms applied).
Retention & provenance: User Intuition retains ownership of the underlying corpus. Licensees receive structured derivative data under the terms of the agreement. Provenance is carried at the record level: study ID, protocol version, ontology schema version, consent scope.
Audio & video: Transcripts ship; audio and video do not. Voice-derived features are extracted at processing time and surfaced as ontology fields on the transcript. Buyers requiring biometric audio for specific use cases can discuss separately under additional consent terms.
Sample data: Anonymized illustrative samples available on inquiry.

Team

Who's running this.

User Intuition runs a self-serve research SaaS used by product, customer-experience, and insights teams to commission AI-moderated voice interviews and the structured outputs derived from them. The same platform infrastructure — including the Human Signal MCP, our model-context protocol layer for grounding agents in real human cognition — produces the human preference datasets described here.

The team is led by the founder and CEO (ex-McKinsey, Harvard MBA, BS Electrical Engineering, Yale), and the broader User Intuition Research Team — the methodologists, engineers, and panel operators who design protocols, build the ontology, and run the studies.

Methodology, infrastructure, and team have produced tens of thousands of structured human-cognition interviews across consumer and professional domains since the platform was built.

Partnership

Design partners.

The first generation of human preference datasets for unverifiable tasks is being co-designed with a small number of frontier-lab post-training teams, agent companies, and synthetic-data teams. Design partners shape protocol, ontology, and grounding configuration. In exchange: input rights on study scope, preferential licensing on the resulting corpus, and direct work with the User Intuition Research Team. Limited spots in 2026.

Apply to be a design partner