EVALUATING CLINICAL REASONING IN EMERGENCY MEDICINE AI
How 3,984 patients produced only 72 physician rationales — and what that reveals about the infrastructure gap between building clinical AI and knowing whether it actually thinks like a doctor.
MEHANDRU ET AL. · UC BERKELEY & UCSF · 2025
THE STUDY
In 2025, researchers at UC Berkeley and UCSF published “ER-Reason: A Benchmark Dataset for LLMLarge language model — an AI system trained on vast amounts of text that can generate and understand natural language. GPT-4, Claude, and Gemini are examples.-Based Clinical Reasoning in the Emergency Room,” a benchmark designed to evaluate whether large language models can reason through real emergency department cases the way clinicians do.
The premise addresses a growing gap in medical AI. Most benchmarks evaluate clinical LLMsLarge language model — an AI system trained on vast amounts of text that can generate and understand natural language. GPT-4, Claude, and Gemini are examples. using multiple-choice questions from licensing exams like the USMLE. Models now score above 90% on these tests. But licensing exams test knowledge recall in curated clinical vignettes with unambiguous answer choices. Real clinical decisions involve synthesizing fragmented, longitudinal records under time pressure — and the reasoning behind the decision matters as much as the decision itself.
The emergency room is a particularly demanding test of this distinction. ER physicians operate under a “worst-first” paradigm: rule-out reasoningA clinical reasoning pattern where the physician prioritizes excluding the most dangerous possible diagnoses first ("worst-first" approach), even if they are less likely. Standard practice in emergency medicine. that prioritizes excluding life-threatening conditions before pursuing the most likely diagnosis. The consequences of flawed reasoning are immediate. A model that recommends the right treatment but fails to consider a pulmonary embolism in the differential is more dangerous than one that's uncertain and says so.
The dataset
ER-Reason includes data from 3,984 patients spanning 25,174 de-identified longitudinal clinical notes: discharge summaries, progress notes, history and physical exams, consult notes, echocardiography and imaging reports, and ER provider documentation. The dataset covers 395 unique chief complaints, with abdominal pain, shortness of breath, and chest pain the most frequent.
The evaluation tasks
The benchmark defines five tasks aligned with the ER workflow: triage intake ( ESIEmergency Severity Index — a five-level triage system (1 = critical, 5 = non-urgent) used in U.S. emergency departments to prioritize patients by acuity and expected resource needs. acuity assessment), EHR review and patient summarization, initial assessment with differential diagnosisThe systematic process of ruling out possible conditions that could explain a patient's symptoms. Clinicians generate a ranked list of plausible diagnoses, then order tests to narrow it — a core clinical reasoning skill., treatment selection, and final diagnosis with disposition planning. Each task evaluates a different stage of the decision-making process, from sparse initial information to full diagnostic workup.
But the benchmark's most important contribution isn't the tasks. It's what they tried to collect as the ground truth for evaluating clinical reasoning — and how little of it they were able to get.
THE 72 RATIONALES
Of the 3,984 patient encounters in the dataset, 72 received physician-authored rationales — detailed, step-by-step explanations capturing the clinical reasoning behind ER decisions. That's 1.8% of the dataset.
These 72 rationales are the most valuable component of the entire benchmark. They capture what standard ER documentation systematically omits: the reasoning tracesStructured records of the step-by-step thinking behind a clinical decision — not just the final answer, but the evidence considered, alternatives weighed, and logic applied. behind clinical decisions. Not just “we ordered a CT angiogram” but why — which differential diagnoses were considered, which were ruled out, which medical factors drove the decision to order that specific test rather than another.
What each rationale contains
Each physician rationale covers three dimensions of clinical reasoning: rule-out reasoningA clinical reasoning pattern where the physician prioritizes excluding the most dangerous possible diagnoses first ("worst-first" approach), even if they are less likely. Standard practice in emergency medicine. — the systematic enumeration and exclusion of plausible diagnoses; identification of relevant medical decision factors — which labs, imaging studies, and clinical signs inform the diagnostic path; and treatment planning — the rationale for specific interventions and their prioritization.
The rationales were designed to mimic the teaching process used in residency training, where attending physicians walk through their reasoning explicitly for educational purposes. This is the kind of clinical thinking that happens in every ER encounter but is almost never written down — because the pace of emergency medicine doesn't allow for it, and documentation standards don't require it.
Why only 72
Collecting these rationales required building a custom application, securing IRB approval, recruiting practicing ER attending and resident physicians, compensating them for their time, and guiding each clinician through a structured workflow for each patient case. The researchers did everything right — structured collection protocol, practicing clinicians, IRB oversight. They simply hit the fundamental constraint: physician time is finite and expensive, and there is no reusable infrastructure for collecting clinical reasoning at scale.
The result is a dataset where 98.2% of patient encounters have no physician-authored reasoning at all. The evaluation signal the field most needs — structured clinical thinking — is the signal that exists for the fewest cases.
WHY BENCHMARKS WITHOUT REASONING ARE DANGEROUS
The ER-Reason results illustrate exactly why accuracy-only evaluation is insufficient for clinical AI.
Right answer, wrong reasoning
The benchmark evaluated four LLMsLarge language model — an AI system trained on vast amounts of text that can generate and understand natural language. GPT-4, Claude, and Gemini are examples. including GPT-4o and o3-mini. On the triage task, o3-mini achieved the highest accuracy at 62.7% — but it did so by over-classifying patients as “Urgent” (73.62% predicted vs. 54.83% actual) and failing to identify any Less Urgent or Non-Urgent cases. It effectively compressed the five-level ESIEmergency Severity Index — a five-level triage system (1 = critical, 5 = non-urgent) used in U.S. emergency departments to prioritize patients by acuity and expected resource needs. scale into a binary one, defaulting to mid-level acuity for nearly everything.
In a clinical setting, this behavior looks like acceptable accuracy on a benchmark — but it translates to resource misallocation at scale. Truly emergent patients get undertreated. Routine cases get overtreated. The 155 million annual ER visits in the United States make even small classification errors consequential.
The final diagnosis illusion
On the final diagnosis task, o3-mini achieved only 34.40% exact-match accuracy on ICD-10 codes. But when evaluated at the broader HCCHierarchical Condition Category — a CMS classification system that groups ICD-10 diagnosis codes into broader, clinically meaningful categories. Used for risk adjustment and population health management. category level, accuracy jumped to approximately 80%. This spread reveals a model that has a rough clinical sense — it knows the right neighborhood — but lacks the precision that actual clinical decisions require.
Without physician reasoning to evaluate how the model reached its conclusions, there's no way to distinguish between a model that narrowed correctly through a differential diagnosis and one that pattern-matched to a plausible-sounding code. The accuracy number looks the same. The clinical safety profile is entirely different.
The disposition bias
On the disposition task — deciding whether a patient should be discharged, admitted, or transferred — models exhibited a systematic bias toward predicting admission over discharge. This is the safest possible error for the model to make (admitting a patient who could be discharged is less dangerous than the reverse), but it suggests the model has learned to be risk-averse rather than clinically discriminating.
In each case, the failure isn't in the model's knowledge — it's in its reasoning. And the only way to diagnose reasoning failures is to have physician-authored reasoning to compare against. Seventy-two cases is not enough to characterize the reasoning behavior of any model at clinical deployment scale.
WHAT THE ANNOTATION CONSTRAINTS FORCED
The 72-rationale bottleneck didn't just limit the dataset. It shaped the entire evaluation methodology — forcing the team into proxy metrics and automated pipelines that lose the signal they were trying to measure.
Automated concept matching instead of reasoning evaluation
To evaluate the treatment planning task — the core clinical reasoning stage — the team mapped free-text model outputs and physician rationales to UMLSUnified Medical Language System — a compendium of biomedical vocabularies maintained by the National Library of Medicine. Maps across ICD, SNOMED-CT, RxNorm, LOINC, and other coding systems to enable concept-level comparison. Concept Unique Identifiers using the cTAKES clinical NLP toolkit. The evaluation metric was clinical concept recallA metric that maps free-text clinical outputs to standardized medical concepts (via UMLS), then measures how many physician-identified concepts the model also identified. Captures semantic overlap, not reasoning coherence.: what proportion of the clinical concepts in the physician's reasoning also appeared in the model's output?
This is an engineering solution to an infrastructure problem. Concept recall measures whether a model mentions the same medical entities as the physician — the same lab tests, the same diagnoses, the same medications. But it cannot measure whether the model connected those concepts correctly. A model that mentions “troponin,” “chest pain,” and “pulmonary embolism” scores the same concept recall whether it said “troponin was normal, ruling out MI, so consider PE” or “troponin was elevated, confirming PE.” The concepts match. The reasoning is contradictory.
No schema iteration
The researchers built a custom application to guide physicians through a structured workflow for rationale collection. But this application was purpose-built for one study. There was no ability to run calibration batchesSmall pilot annotation rounds (typically 50–100 records) used to test and refine an annotation schema before scaling to full production. Identifies ambiguities and schema issues early. to test whether the schema captured the right dimensions of reasoning, iterate on the schema based on annotator feedback, or refine the collection protocol before scaling. The schema was designed once and deployed once.
No multi-physician validation
The rationales come from individual physicians. There was no measurement of inter-annotator agreementHow consistently different annotators label the same data. High agreement means annotators produce similar labels; low agreement may indicate ambiguous schemas or genuinely uncertain cases. on the reasoning dimension: would two ER physicians produce the same rule-out differential for the same patient? Would they identify the same medical decision factors? Without this signal, there's no way to separate genuine clinical consensus from individual reasoning style — and no way to know whether the “gap” between LLM and physician reasoning reflects a model limitation or a measurement artifact.
Single institution, single department
All data comes from the ER of a single large academic medical center (UCSF). ER practice patterns vary significantly across institution types: academic medical centers see different patient populations, have different resource availability, and follow different protocols than community hospitals or rural EDs. A benchmark that captures reasoning from one institution's physicians cannot tell you whether those reasoning patterns transfer to any other setting.
THE BROADER EVIDENCE
ER-Reason is not an isolated effort. The entire field of clinical AI evaluation is converging on the same conclusion: accuracy-only benchmarks are insufficient, physician reasoning is the missing evaluation signal, and no one has the infrastructure to collect it at scale.
MedR-Bench: reasoning is factual but incomplete
A 2025 study published in Nature Communications introduced MedR-Bench, a benchmark of 1,453 structured clinical cases with reference reasoning derived from published case reports. Their automated Reasoning Evaluator measured three dimensions: efficiency (does each step add new information?), factuality (are steps medically accurate?), and completeness (are critical reasoning steps present?).
The results are revealing. Current LLMsLarge language model — an AI system trained on vast amounts of text that can generate and understand natural language. GPT-4, Claude, and Gemini are examples. achieve nearly 90% factuality — their reasoning steps are generally medically accurate. But completeness scores are substantially lower: critical reasoning steps are routinely missing. Models get the facts right but skip the logic that connects them. This is exactly the pattern ER-Reason also found, and it's a pattern that accuracy-only benchmarks cannot detect.
LiveClin: contamination makes static benchmarks unreliable
A 2025 study accepted at ICLR 2026 introduced LiveClin, a live clinical benchmark designed to resist data contaminationWhen test or evaluation data appears in a model's training set, inflating performance scores. The model recognizes answers from memory rather than reasoning to them.. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin engaged 239 physicians to transform authentic patient cases into evaluation scenarios spanning the entire clinical pathway. The top-performing model achieved a case accuracy of just 35.7% — compared to 90%+ on contaminated MCQMultiple-choice question — a test format where the answer is selected from a fixed set of options. Easier for models than open-ended clinical reasoning because it constrains the answer space. benchmarks.
LiveClin demonstrates that when you remove contaminationWhen test or evaluation data appears in a model's training set, inflating performance scores. The model recognizes answers from memory rather than reasoning to them. and test on genuinely novel clinical scenarios, the performance gap is enormous. But even LiveClin's physician-intensive approach faces sustainability questions: coordinating 239 physicians for each biannual update is a logistical feat that few organizations can replicate.
The convergent pattern
ER-Reason, MedR-Bench, LiveClin. Each team independently concluded that physician reasoning is the evaluation signal the field needs. Each team independently hit the same bottleneck: collecting that reasoning is operationally intractable without dedicated infrastructure. The 72 rationales in ER-Reason, the case-report-derived reasoning in MedR-Bench, the 239-physician workflow in LiveClin — these are all heroic workarounds for the same missing piece. The infrastructure for physician-led evaluation doesn't exist yet.
HOW FABRICA CHANGES THIS
The ER-Reason team didn't collect only 72 rationales because they thought that was enough. They collected 72 because building a custom application, recruiting ER physicians, and managing a structured annotation workflow from scratch — for a single study — is the only option available today. Fabrica replaces the custom application with reusable infrastructure.
Reasoning traces as standard output
Fabrica's core annotation output isn't a label — it's a structured reasoning traceStructured records of the step-by-step thinking behind a clinical decision — not just the final answer, but the evidence considered, alternatives weighed, and logic applied.. For each clinical decision, annotating physicians record the evidence they considered, the alternatives they weighed, the confidence they assign, and the logic connecting observations to conclusions. This is exactly what ER-Reason's 72 rationales captured: rule-out reasoning, medical decision factors, treatment rationale. The difference is that Fabrica produces this as standard annotation output, not as a special collection effort limited to 1.8% of cases.
Physician evaluation workflows, not one-off applications
ER-Reason's custom collection app was built for one study. When the next team needs physician rationales for a different clinical domain — cardiology, oncology, radiology — they build another custom app from scratch. Fabrica provides the workflow infrastructure that generalizes: schema design and iteration, calibration batchesSmall pilot annotation rounds (typically 50–100 records) used to test and refine an annotation schema before scaling to full production. Identifies ambiguities and schema issues early. before scaling, multi-reader annotation with disagreement tracking, and quality metrics throughout. The investment in infrastructure compounds across studies instead of evaporating after each one.
Process-based evaluation, not just concept overlap
ER-Reason was forced to evaluate reasoning via clinical concept recallA metric that maps free-text clinical outputs to standardized medical concepts (via UMLS), then measures how many physician-identified concepts the model also identified. Captures semantic overlap, not reasoning coherence. — a proxy metric that measures whether models mention the same medical entities as physicians, but not whether they connect them correctly. Fabrica's evaluation datasets are built with multi-dimensional scoring: not just accuracy, but safety (does the model miss dangerous diagnoses?), uncertainty calibration (does the model know what it doesn't know?), reasoning quality (is the logic sound?), and robustness (does performance hold under distribution shift?). This is evaluation infrastructure, not a one-time metric.
Scalable physician networks across institutions
ER-Reason's rationales come from physicians at one institution. LiveClin required coordinating 239 physicians for each update cycle. Fabrica maintains physician annotator networks across institutions, enabling cross-site evaluation that captures the real-world variation in clinical reasoning. Different hospitals, different patient populations, different practice patterns — all reflected in the evaluation data, so models are tested against the diversity they'll face in deployment.
The bottom line
ER-Reason demonstrates that clinical reasoning is the evaluation signal the field needs. Models that score well on accuracy can still reason dangerously — compressing triage scales, defaulting to conservative dispositions, getting to the right answer through the wrong logic. The only way to detect these failure modes is with physician-authored reasoning to compare against. But 72 rationales from 3,984 patients — a 1.8% coverage rate — proves that collecting this signal ad hoc is untenable. The annotation bottleneck for clinical AI isn't just about labels. It's about reasoning. Fabrica builds the infrastructure to capture it.
REQUEST EARLY ACCESSMehandru, N., Golchini, N., Bamman, D., Zack, T., Molina, M.F., & Alaa, A. (2025). ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room. arXiv:2505.22919
ADDITIONAL REFERENCESQiu, P., Wu, C., et al. (2025). Quantifying the reasoning abilities of LLMs on clinical cases. Nature Communications, 16, 9799.
LiveClin (2025). A Live Clinical Benchmark without Leakage. ICLR 2026 Conference Submission.