CASE STUDY

EVALUATING CLINICAL REASONING IN EMERGENCY MEDICINE AI

How 3,984 patients produced only 72 physician rationales — and what that reveals about the infrastructure gap between building clinical AI and knowing whether it actually thinks like a doctor.

MEHANDRU ET AL. · UC BERKELEY & UCSF · 2025

THE STUDY

In 2025, researchers at UC Berkeley and UCSF published “ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room,” a benchmark designed to evaluate whether large language models can reason through real emergency department cases the way clinicians do.

The premise addresses a growing gap in medical AI. Most benchmarks evaluate clinical LLMs using multiple-choice questions from licensing exams like the USMLE. Models now score above 90% on these tests. But licensing exams test knowledge recall in curated clinical vignettes with unambiguous answer choices. Real clinical decisions involve synthesizing fragmented, longitudinal records under time pressure — and the reasoning behind the decision matters as much as the decision itself.

The emergency room is a particularly demanding test of this distinction. ER physicians operate under a “worst-first” paradigm: rule-out reasoning that prioritizes excluding life-threatening conditions before pursuing the most likely diagnosis. The consequences of flawed reasoning are immediate. A model that recommends the right treatment but fails to consider a pulmonary embolism in the differential is more dangerous than one that's uncertain and says so.

The dataset

ER-Reason includes data from 3,984 patients spanning 25,174 de-identified longitudinal clinical notes: discharge summaries, progress notes, history and physical exams, consult notes, echocardiography and imaging reports, and ER provider documentation. The dataset covers 395 unique chief complaints, with abdominal pain, shortness of breath, and chest pain the most frequent.

The evaluation tasks

The benchmark defines five tasks aligned with the ER workflow: triage intake ( ESI acuity assessment), EHR review and patient summarization, initial assessment with differential diagnosis, treatment selection, and final diagnosis with disposition planning. Each task evaluates a different stage of the decision-making process, from sparse initial information to full diagnostic workup.

But the benchmark's most important contribution isn't the tasks. It's what they tried to collect as the ground truth for evaluating clinical reasoning — and how little of it they were able to get.

THE 72 RATIONALES

Of the 3,984 patient encounters in the dataset, 72 received physician-authored rationales — detailed, step-by-step explanations capturing the clinical reasoning behind ER decisions. That's 1.8% of the dataset.

These 72 rationales are the most valuable component of the entire benchmark. They capture what standard ER documentation systematically omits: the reasoning traces behind clinical decisions. Not just “we ordered a CT angiogram” but why — which differential diagnoses were considered, which were ruled out, which medical factors drove the decision to order that specific test rather than another.

What each rationale contains

Each physician rationale covers three dimensions of clinical reasoning: rule-out reasoning — the systematic enumeration and exclusion of plausible diagnoses; identification of relevant medical decision factors — which labs, imaging studies, and clinical signs inform the diagnostic path; and treatment planning — the rationale for specific interventions and their prioritization.

The rationales were designed to mimic the teaching process used in residency training, where attending physicians walk through their reasoning explicitly for educational purposes. This is the kind of clinical thinking that happens in every ER encounter but is almost never written down — because the pace of emergency medicine doesn't allow for it, and documentation standards don't require it.

Why only 72

Collecting these rationales required building a custom application, securing IRB approval, recruiting practicing ER attending and resident physicians, compensating them for their time, and guiding each clinician through a structured workflow for each patient case. The researchers did everything right — structured collection protocol, practicing clinicians, IRB oversight. They simply hit the fundamental constraint: physician time is finite and expensive, and there is no reusable infrastructure for collecting clinical reasoning at scale.

The result is a dataset where 98.2% of patient encounters have no physician-authored reasoning at all. The evaluation signal the field most needs — structured clinical thinking — is the signal that exists for the fewest cases.

WHY BENCHMARKS WITHOUT REASONING ARE DANGEROUS

The ER-Reason results illustrate exactly why accuracy-only evaluation is insufficient for clinical AI.

Right answer, wrong reasoning

The benchmark evaluated four LLMs including GPT-4o and o3-mini. On the triage task, o3-mini achieved the highest accuracy at 62.7% — but it did so by over-classifying patients as “Urgent” (73.62% predicted vs. 54.83% actual) and failing to identify any Less Urgent or Non-Urgent cases. It effectively compressed the five-level ESI scale into a binary one, defaulting to mid-level acuity for nearly everything.

In a clinical setting, this behavior looks like acceptable accuracy on a benchmark — but it translates to resource misallocation at scale. Truly emergent patients get undertreated. Routine cases get overtreated. The 155 million annual ER visits in the United States make even small classification errors consequential.

The final diagnosis illusion

On the final diagnosis task, o3-mini achieved only 34.40% exact-match accuracy on ICD-10 codes. But when evaluated at the broader HCC category level, accuracy jumped to approximately 80%. This spread reveals a model that has a rough clinical sense — it knows the right neighborhood — but lacks the precision that actual clinical decisions require.

Without physician reasoning to evaluate how the model reached its conclusions, there's no way to distinguish between a model that narrowed correctly through a differential diagnosis and one that pattern-matched to a plausible-sounding code. The accuracy number looks the same. The clinical safety profile is entirely different.

The disposition bias

On the disposition task — deciding whether a patient should be discharged, admitted, or transferred — models exhibited a systematic bias toward predicting admission over discharge. This is the safest possible error for the model to make (admitting a patient who could be discharged is less dangerous than the reverse), but it suggests the model has learned to be risk-averse rather than clinically discriminating.

In each case, the failure isn't in the model's knowledge — it's in its reasoning. And the only way to diagnose reasoning failures is to have physician-authored reasoning to compare against. Seventy-two cases is not enough to characterize the reasoning behavior of any model at clinical deployment scale.

WHAT THE ANNOTATION CONSTRAINTS FORCED

The 72-rationale bottleneck didn't just limit the dataset. It shaped the entire evaluation methodology — forcing the team into proxy metrics and automated pipelines that lose the signal they were trying to measure.

Automated concept matching instead of reasoning evaluation

To evaluate the treatment planning task — the core clinical reasoning stage — the team mapped free-text model outputs and physician rationales to UMLS Concept Unique Identifiers using the cTAKES clinical NLP toolkit. The evaluation metric was clinical concept recall: what proportion of the clinical concepts in the physician's reasoning also appeared in the model's output?

This is an engineering solution to an infrastructure problem. Concept recall measures whether a model mentions the same medical entities as the physician — the same lab tests, the same diagnoses, the same medications. But it cannot measure whether the model connected those concepts correctly. A model that mentions “troponin,” “chest pain,” and “pulmonary embolism” scores the same concept recall whether it said “troponin was normal, ruling out MI, so consider PE” or “troponin was elevated, confirming PE.” The concepts match. The reasoning is contradictory.

No schema iteration

The researchers built a custom application to guide physicians through a structured workflow for rationale collection. But this application was purpose-built for one study. There was no ability to run calibration batches to test whether the schema captured the right dimensions of reasoning, iterate on the schema based on annotator feedback, or refine the collection protocol before scaling. The schema was designed once and deployed once.

No multi-physician validation

The rationales come from individual physicians. There was no measurement of inter-annotator agreement on the reasoning dimension: would two ER physicians produce the same rule-out differential for the same patient? Would they identify the same medical decision factors? Without this signal, there's no way to separate genuine clinical consensus from individual reasoning style — and no way to know whether the “gap” between LLM and physician reasoning reflects a model limitation or a measurement artifact.

Single institution, single department

All data comes from the ER of a single large academic medical center (UCSF). ER practice patterns vary significantly across institution types: academic medical centers see different patient populations, have different resource availability, and follow different protocols than community hospitals or rural EDs. A benchmark that captures reasoning from one institution's physicians cannot tell you whether those reasoning patterns transfer to any other setting.

THE BROADER EVIDENCE

ER-Reason is not an isolated effort. The entire field of clinical AI evaluation is converging on the same conclusion: accuracy-only benchmarks are insufficient, physician reasoning is the missing evaluation signal, and no one has the infrastructure to collect it at scale.

MedR-Bench: reasoning is factual but incomplete

A 2025 study published in Nature Communications introduced MedR-Bench, a benchmark of 1,453 structured clinical cases with reference reasoning derived from published case reports. Their automated Reasoning Evaluator measured three dimensions: efficiency (does each step add new information?), factuality (are steps medically accurate?), and completeness (are critical reasoning steps present?).

The results are revealing. Current LLMs achieve nearly 90% factuality — their reasoning steps are generally medically accurate. But completeness scores are substantially lower: critical reasoning steps are routinely missing. Models get the facts right but skip the logic that connects them. This is exactly the pattern ER-Reason also found, and it's a pattern that accuracy-only benchmarks cannot detect.

LiveClin: contamination makes static benchmarks unreliable

A 2025 study accepted at ICLR 2026 introduced LiveClin, a live clinical benchmark designed to resist data contamination. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin engaged 239 physicians to transform authentic patient cases into evaluation scenarios spanning the entire clinical pathway. The top-performing model achieved a case accuracy of just 35.7% — compared to 90%+ on contaminated MCQ benchmarks.

LiveClin demonstrates that when you remove contamination and test on genuinely novel clinical scenarios, the performance gap is enormous. But even LiveClin's physician-intensive approach faces sustainability questions: coordinating 239 physicians for each biannual update is a logistical feat that few organizations can replicate.

The convergent pattern

ER-Reason, MedR-Bench, LiveClin. Each team independently concluded that physician reasoning is the evaluation signal the field needs. Each team independently hit the same bottleneck: collecting that reasoning is operationally intractable without dedicated infrastructure. The 72 rationales in ER-Reason, the case-report-derived reasoning in MedR-Bench, the 239-physician workflow in LiveClin — these are all heroic workarounds for the same missing piece. The infrastructure for physician-led evaluation doesn't exist yet.

HOW FABRICA CHANGES THIS

The ER-Reason team didn't collect only 72 rationales because they thought that was enough. They collected 72 because building a custom application, recruiting ER physicians, and managing a structured annotation workflow from scratch — for a single study — is the only option available today. Fabrica replaces the custom application with reusable infrastructure.

Reasoning traces as standard output

Fabrica's core annotation output isn't a label — it's a structured reasoning trace. For each clinical decision, annotating physicians record the evidence they considered, the alternatives they weighed, the confidence they assign, and the logic connecting observations to conclusions. This is exactly what ER-Reason's 72 rationales captured: rule-out reasoning, medical decision factors, treatment rationale. The difference is that Fabrica produces this as standard annotation output, not as a special collection effort limited to 1.8% of cases.

Physician evaluation workflows, not one-off applications

ER-Reason's custom collection app was built for one study. When the next team needs physician rationales for a different clinical domain — cardiology, oncology, radiology — they build another custom app from scratch. Fabrica provides the workflow infrastructure that generalizes: schema design and iteration, calibration batches before scaling, multi-reader annotation with disagreement tracking, and quality metrics throughout. The investment in infrastructure compounds across studies instead of evaporating after each one.

Process-based evaluation, not just concept overlap

ER-Reason was forced to evaluate reasoning via clinical concept recall — a proxy metric that measures whether models mention the same medical entities as physicians, but not whether they connect them correctly. Fabrica's evaluation datasets are built with multi-dimensional scoring: not just accuracy, but safety (does the model miss dangerous diagnoses?), uncertainty calibration (does the model know what it doesn't know?), reasoning quality (is the logic sound?), and robustness (does performance hold under distribution shift?). This is evaluation infrastructure, not a one-time metric.

Scalable physician networks across institutions

ER-Reason's rationales come from physicians at one institution. LiveClin required coordinating 239 physicians for each update cycle. Fabrica maintains physician annotator networks across institutions, enabling cross-site evaluation that captures the real-world variation in clinical reasoning. Different hospitals, different patient populations, different practice patterns — all reflected in the evaluation data, so models are tested against the diversity they'll face in deployment.

The bottom line

ER-Reason demonstrates that clinical reasoning is the evaluation signal the field needs. Models that score well on accuracy can still reason dangerously — compressing triage scales, defaulting to conservative dispositions, getting to the right answer through the wrong logic. The only way to detect these failure modes is with physician-authored reasoning to compare against. But 72 rationales from 3,984 patients — a 1.8% coverage rate — proves that collecting this signal ad hoc is untenable. The annotation bottleneck for clinical AI isn't just about labels. It's about reasoning. Fabrica builds the infrastructure to capture it.

REQUEST EARLY ACCESS

SOURCE

Mehandru, N., Golchini, N., Bamman, D., Zack, T., Molina, M.F., & Alaa, A. (2025). ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room. arXiv:2505.22919

ADDITIONAL REFERENCES

Qiu, P., Wu, C., et al. (2025). Quantifying the reasoning abilities of LLMs on clinical cases. Nature Communications, 16, 9799.

LiveClin (2025). A Live Clinical Benchmark without Leakage. ICLR 2026 Conference Submission.

← BACK TO ALL CASE STUDIES