GUIDE

BUILDING GOLD-STANDARD EVALUATION SETS

Why most clinical AI benchmarks are broken — and what it takes to build evaluation datasets that actually measure clinical reasoning.

CLINICAL AI HAS AN EVALUATION CRISIS

There are over 53 medical LLM benchmarks. A systematic review found issues across nearly all of them: disconnection from clinical practice, data contamination, over-reliance on multiple-choice format, and neglect of safety-critical dimensions like robustness and uncertainty awareness.

The result is an “evaluation illusion” — models appear to perform well on benchmarks that don't reflect the tasks they'll actually face in clinical settings. Discrepancies in data choices, task design, and metrics produce misleading conclusions about real-world efficacy.

For teams building clinical AI, this means you cannot rely on public benchmarks to tell you whether your model is ready. You need evaluation sets designed around your specific clinical domain, sourced from real encounters, and validated by physicians who understand the decision being tested.

THE BENCHMARK SATURATION PROBLEM

Models score ~90% on board-style MCQ benchmarks like MedQA. The same models achieve 35.7% accuracy on LiveClin — a continuously updated benchmark using real-world case reports.

The gap isn't surprising. Board exams test knowledge retrieval — can the model recall the correct fact from a set of options? Clinical practice requires reasoning under uncertainty with incomplete information, ambiguous presentations, and competing plausible diagnoses. Most existing benchmarks evaluate the former while claiming to measure the latter.

The multiple-choice trap

Multiple-choice formats artificially constrain the problem space. A model can score well through elimination strategies and statistical associations without genuine clinical reasoning. Real clinical decisions don't come with four options — they require generating hypotheses from open-ended observations, then systematically narrowing.

Beyond simple model scaling

LiveClin's results revealed that simple model scaling no longer produces consistent improvements on real-world clinical tasks. Larger models don't reliably outperform smaller ones when the evaluation requires genuine reasoning rather than pattern matching. This suggests the bottleneck has shifted from model capacity to evaluation quality — and by extension, to the quality of the data used to train and evaluate.

DATA CONTAMINATION

Static benchmarks built from published medical literature are increasingly contaminated — their questions and answers appear in LLM training corpora, inflating evaluation scores. This isn't hypothetical; it is a documented, systemic problem across medical AI evaluation.

How contamination happens

Most medical benchmarks draw from textbooks, board review questions, or published case reports. These same sources are ingested during LLM pre-training. When a model encounters a benchmark question it has seen (or seen a close paraphrase of) during training, it can reproduce the answer from memory rather than reasoning about it. The evaluation becomes a recall test, not a reasoning test.

The assumption you should make

Any evaluation set drawn from publicly available medical case reports, board questions, or academic literature should be assumed contaminated for current-generation LLMs. This includes datasets that were “private” at creation but have since been published, referenced in papers, or partially leaked through model-generated content.

What uncontaminated data looks like

Uncontaminated evaluation data comes from sources that never touch the public internet: de-identified internal clinical records, prospectively collected cases, or physician-authored scenarios written specifically for evaluation. This is expensive to produce and impossible to crowdsource — which is precisely why it has value.

TESTING REASONING, NOT RETRIEVAL

The gap between knowledge retrieval and clinical reasoning is the central challenge in evaluation design. A useful benchmark must distinguish between a model that knows the right answer and a model that can arrive at the right answer through valid clinical reasoning.

Multi-step inference

Clinical reasoning is inherently multi-step: observe findings, generate hypotheses, order tests, integrate new information, narrow the differential. Evaluation tasks should require this chain of inference — not just a single-hop from question to answer. Cases that can be answered correctly with a single association (symptom → diagnosis) are testing retrieval, not reasoning.

Diagnostic ambiguity

The most informative evaluation cases are those where the correct answer depends on how evidence is weighed — cases with competing plausible diagnoses, incomplete workups, or findings that shift interpretation based on context. These cases test whether a model can reason through uncertainty rather than default to the most common association.

Evaluating the reasoning chain

Scoring only the final answer discards the most valuable signal. A model that reaches the correct diagnosis through incorrect reasoning is a liability; a model that reasons correctly but narrows to the wrong final answer may be one data point away from clinical utility. Physician-validated evaluation sets should include reference reasoning chains against which model reasoning can be compared — not just answer keys.

CONSTRUCTING EVALUATION SETS THAT WORK

Building a useful clinical evaluation set is fundamentally an annotation problem. It requires physicians who can construct clinically valid scenarios with verified reasoning chains and calibrated difficulty.

Source from real encounters

The strongest evaluation cases are derived from de-identified real clinical encounters — not textbook vignettes. Real cases carry the noise, ambiguity, and incompleteness that define actual clinical practice. Physician authors then structure these into evaluation tasks with defined reasoning expectations.

Calibrate difficulty

An evaluation set should span the difficulty spectrum: cases that any competent clinician would get right (floor validation), cases that require specialist knowledge (ceiling probing), and cases with genuine ambiguity (calibration testing). Without this range, you can't distinguish between models at different capability levels.

Physician validation

Every case in a gold-standard set needs physician review — not just for correctness, but for clinical plausibility. Does this case reflect a real clinical scenario? Is the expected reasoning chain defensible? Would a competent clinician approach this case the way the evaluation expects? Cases that fail this check introduce evaluation noise regardless of their factual accuracy.

Multi-dimensional scoring

Clinical evaluation should assess more than accuracy. Safety (does the model flag dangerous recommendations?), uncertainty awareness (does it express appropriate confidence?), reasoning quality (is the reasoning chain clinically valid?), and robustness (does performance hold across demographic and clinical subgroups?) are all dimensions that public benchmarks largely neglect — and that matter for deployment decisions.

CONTINUOUS REFRESH AND MAINTENANCE

A static evaluation set has a half-life. As models are trained on newer data, as clinical guidelines evolve, and as contamination vectors multiply, any fixed benchmark degrades in usefulness over time.

The case for living benchmarks

Approaches like LiveClin — which refreshes biannually with contemporary case reports — demonstrate the value of continuously updated evaluation sets. The tradeoff is significant ongoing investment in physician-authored content, but the alternative is evaluation decay: scores that look stable while the benchmark quietly stops measuring what it claims to.

Version control for evaluation data

Evaluation sets should be versioned like software. When cases are added, retired, or updated to reflect new clinical guidelines, the version history must be preserved so that model performance can be compared across consistent snapshots. Without this discipline, longitudinal performance tracking becomes meaningless.

WHERE FABRICA FITS

Fabrica's evaluation tools help clinical AI teams build gold-standard benchmarks that test reasoning over retrieval, resist contamination, and reflect real-world diagnostic complexity. Our physician network authors and validates evaluation cases sourced from real clinical encounters — with structured reasoning chains, calibrated difficulty, and multi-dimensional scoring rubrics.

Evaluation is one piece of the clinical AI data pipeline. See our companion guides on clinical data annotation and preference data for clinical model alignment.

REQUEST EARLY ACCESS

← BACK TO ALL GUIDES