GUIDE

ML MODELS FOR CLINICAL AI

What actually works, what doesn't, and why data quality matters more than model architecture in clinical AI.

01

THE MODEL ZOO: WHAT CLINICAL AI TEAMS ARE ACTUALLY USING

A general ML guide would walk you through CNNs, RNNs, transformers, and diffusion models as abstract categories. That framing is not useful if you are building clinical AI. What matters is what teams are actually deploying, what is stuck in research, and what has quietly failed. The landscape has converged more than most people realize.

Language models for clinical text

The dominant paradigm is foundation models adapted for clinical use. On the proprietary side, GPT-4 and Med-PaLM 2 demonstrated that general-purpose LLMs could score at or above physician level on medical board examinations. On the open-source side, models like BioMedLM (trained on PubMed abstracts), PMC-LLaMA (continued pretraining of LLaMA on biomedical literature), and Meditron (fine-tuned on curated clinical guidelines) represent attempts to bring domain specificity without API dependency.

For clinical NLP specifically, BERT-family models remain surprisingly relevant. PubMedBERT, ClinicalBERT, and GatorTron (trained on over 90 billion words of clinical text from the University of Florida Health system) still outperform much larger generalist models on tasks like named entity recognition, relation extraction, and clinical note classification — tasks where you need precision on domain-specific structure, not broad generative ability.

Vision models for medical imaging

Medical imaging has its own model lineage. RetFound, a self-supervised vision transformer trained on 1.6 million retinal images, demonstrated that foundation models could work for medical vision — not just NLP. BiomedCLIP aligns medical images with text descriptions, enabling zero-shot classification. For segmentation, the Segment Anything Model (SAM) has been adapted to medical imaging with MedSAM, though its performance on fine-grained clinical structures (e.g., tumor margins) still lags behind purpose-built architectures like nnU-Net.

Multimodal models

The frontier is multimodal: models that combine imaging, text, labs, and genomics. Med-Gemini processes interleaved medical images and text. LLaVA-Med fine-tunes a vision-language model on biomedical image-text pairs. These are research-stage — no team is deploying them in production at scale — but they represent where the field is heading.

The thesis

Across all of these categories, the pattern is the same: the architecture is rarely the differentiator. Most teams working on clinical AI use some variant of a transformer — whether it is a BERT-family encoder, a GPT-family decoder, or a vision transformer. What separates the models that work from the ones that do not is the quality, domain specificity, and structure of the training data. The rest of this guide is about why.

02

FOUNDATION MODELS MEET CLINICAL REALITY

GPT-4 scores 86% on USMLE Step 1. Med-PaLM 2 reaches expert level on MedQA. These headlines created an expectation that foundation models are close to clinical deployment. The reality is more nuanced — and the gap between benchmark performance and clinical utility is where most projects stall.

The benchmark-deployment gap

Medical board exams are multiple-choice questions with a single correct answer, drawn from textbook knowledge. Real clinical practice involves incomplete information, competing plausible diagnoses, time pressure, and consequential tradeoffs. A model that aces Step 1 can still fail at determining whether a specific patient's chest pain is cardiac or musculoskeletal when the history is ambiguous and the ECG is borderline.

The disconnect is structural: MCQ benchmarks test knowledge retrieval, not multi-step clinical inference. Our evaluation guide covers this problem in depth.

Hallucination in safety-critical contexts

Hallucination is an inconvenience in a chatbot and a liability in clinical AI. Foundation models confidently generate drug interactions that do not exist, cite fabricated clinical trials, and recommend dosages that are off by orders of magnitude. Studies have found that even state-of-the-art models hallucinate in 5–15% of medical responses — a rate that is disqualifying for any application with patient-facing consequences.

The problem is compounded by calibration: models do not reliably distinguish between responses they are confident about and ones they are guessing on. A hallucinated drug interaction carries the same confident tone as a well-established one.

Prompting vs. fine-tuning vs. training from scratch

There is a spectrum of effort and control. Prompting a general-purpose model (GPT-4 with a clinical system prompt) is the lowest-effort approach but offers the least control over behavior, no guarantees about data handling, and inherits the model's full hallucination surface. Fine-tuning a foundation model on domain-specific data (LoRA on LLaMA with clinical notes) gives more control and can be done on self-hosted infrastructure. Training a domain-specific model from scratch is the most expensive but gives full control over data provenance, behavior, and compliance.

Compliance as architecture constraint

For any application involving protected health information, the question “which model should we use?” is often answered by “which models can we legally send this data to?” This rules out most API-based foundation models for production workloads on clinical data, pushing teams toward self-hosted open-source models or on-premises deployments. HIPAA compliance is not a feature request — it is an architectural constraint that narrows the model space before performance is even considered.

03

CLINICAL NLP IS ITS OWN DISCIPLINE

Clinical NLP is not just NLP applied to medicine. Clinical text has properties that fundamentally break assumptions baked into general-purpose language models — and ignoring them produces models that look competent on benchmarks but fail in practice.

Negation and assertion

“No evidence of malignancy” contains the word “malignancy” but asserts its absence. Clinical text is saturated with negation, hedging, and conditional statements: “denies chest pain,” “unlikely to represent metastatic disease,” “cannot rule out PE.” A model that treats token presence as semantic presence will extract the opposite of what the note says. This is not an edge case — studies have found that up to 50% of clinical concepts in discharge summaries are negated.

Abbreviation ambiguity

“MS” means multiple sclerosis in neurology, mitral stenosis in cardiology, mental status in psychiatry, and morphine sulfate in pharmacy. “PT” is either physical therapy, prothrombin time, or patient depending on context. A study catalogued over 174,000 abbreviation expansions used in clinical text. General-purpose models trained on internet text have no reliable mechanism for disambiguating these — the correct expansion depends on the clinical context (department, note type, surrounding findings) in ways that require domain-specific training data.

Copy-forward artifacts

EHR systems encourage copy-forward: clinicians paste text from a previous note and modify it. This creates notes where the same paragraph appears across multiple encounters, sometimes with subtle edits and sometimes without any changes at all. A model trained on these notes cannot distinguish between a finding that is actively present and one that was pasted from three weeks ago. Copy-forward also inflates apparent data volume — you may have 10,000 notes but only 3,000 unique clinical observations.

Temporal reasoning

Clinical reasoning is inherently temporal. A potassium of 5.8 means something different if it is trending up from 4.5 versus down from 7.2. The relationship between a symptom onset, a medication change, and a lab result requires understanding time — and clinical notes encode time in inconsistent, often implicit ways. “Since starting lisinopril” requires the model to know when lisinopril was started, which may be documented in a different note entirely.

Tokenization challenges

General-purpose tokenizers were trained on internet text. Clinical abbreviations like “q4h” (every 4 hours), “PRN” (as needed), and “bid” (twice daily) may be split into meaningless subword tokens. Drug names, gene symbols, and ICD codes are similarly under-represented. This is not fatal — but it means that models using general tokenizers require more clinical training data to learn representations that a domain-specific tokenizer would capture natively. Some teams address this with continued pretraining on clinical corpora, which helps the model build better internal representations even with a suboptimal tokenizer.

04

MEDICAL VISION: WHAT'S SOLVED AND WHAT ISN'T

Computer vision in medicine is not one problem. It is dozens of problems at different stages of maturity, with different data constraints, and different relationships between model architecture and annotation quality.

Radiology: relatively mature

Chest X-ray interpretation has the most FDA-cleared AI devices of any imaging modality. Models like CheXNet (2017) and its successors demonstrated human-level performance on specific findings (pneumothorax detection, cardiomegaly classification). The data infrastructure is mature: PACS systems provide standardized DICOM images, and datasets like CheXpert (224,316 chest radiographs) and MIMIC-CXR (377,110 images with free-text reports) provide large-scale training data. The remaining challenges are in generalization across scanner types, patient populations, and the long tail of rare findings.

Pathology: emerging but complex

Digital pathology is earlier in its AI journey, largely because the data is harder. A single whole-slide image can be 40,000 × 40,000 pixels — gigapixel scale. You cannot feed this into a standard CNN. The dominant approach is multiple-instance learning: tile the slide into patches, encode each patch independently, then aggregate patch-level features for a slide-level prediction. Foundation models like CONCH and UNI (trained on hundreds of thousands of histopathology slides) are changing this by providing general-purpose patch encoders, but the aggregation strategy and the quality of pathologist annotations at the slide level remain the performance ceiling.

The annotation ceiling

Across all imaging modalities, there is a consistent finding: model performance plateaus when label quality plateaus, not when model capacity is exhausted. A study on diabetic retinopathy grading found that upgrading annotations from general-ophthalmologist to retina-specialist quality improved model AUC by more than switching from ResNet-50 to a model with 4x the parameters. The same pattern appears in dermatology, where diagnostic accuracy correlates more strongly with annotator expertise than with model architecture.

Domain adaptation is non-trivial

Domain adaptation from natural images to medical images is not a simple transfer. ImageNet features (textures, edges, object shapes) transfer poorly to histopathology slides (cellular morphology, staining patterns, tissue architecture) or to CT scans (3D volumetric data, Hounsfield units, variable slice thickness). Models pretrained on natural images require substantial fine-tuning to be useful in clinical imaging — and that fine-tuning is bottlenecked by the quality of clinical annotations, not compute.

05

THE FINE-TUNING DECISION TREE

Choosing how to adapt a model to your clinical task is not a purely technical decision. It depends on data volume, task specificity, compliance constraints, and budget — and the right answer is different for every team.

Full fine-tuning

Update all model parameters on your clinical dataset. This gives maximum adaptation but requires the most compute, the most labeled data (typically tens of thousands of examples), and risks catastrophic forgetting — where the model loses general capabilities while overfitting to the fine-tuning distribution. Best suited for teams with large proprietary clinical datasets and dedicated ML infrastructure.

LoRA and QLoRA

LoRA freezes the base model and trains small low-rank adapter matrices — typically less than 1% of the original parameters. QLoRA adds 4-bit quantization of the base model, reducing memory requirements further. This is the most common approach for clinical AI teams today: it works with hundreds to low thousands of examples, runs on a single GPU, and preserves the base model's general capabilities. The tradeoff is reduced expressiveness on tasks that require deep domain adaptation.

Continued pretraining

Continued pretraining extends the base model's training on domain-specific corpora (PubMed abstracts, clinical notes, biomedical textbooks) before task-specific fine-tuning. This reshapes the model's internal representations to better capture clinical language. PMC-LLaMA and Meditron both used this approach. It is more expensive than LoRA but produces models with fundamentally better clinical language understanding. Worth the investment when your downstream tasks span multiple clinical applications rather than a single narrow task.

Prompt tuning and in-context learning

Soft prompt tuning trains a small set of continuous vectors prepended to the input, leaving the entire model frozen. This requires the least labeled data and compute, but also provides the least adaptation. In-context learning (providing examples in the prompt) requires no training at all but is limited by context window size and produces inconsistent results. Both approaches are useful for prototyping and proof-of-concept work, less so for production systems.

The decision framework

Ask three questions. How much labeled clinical data do you have? Under 500 examples: prompt tuning or in-context learning. 500–5,000: LoRA. 5,000+: full fine-tuning. Does patient data touch the model? If yes, self-hosted only — this eliminates API-based models and most cloud fine-tuning services. How many downstream tasks? Single task: LoRA directly. Multiple tasks across clinical domains: continued pretraining first, then task-specific LoRA adapters on top.

06

WHY DATA QUALITY BEATS MODEL SCALE

The dominant narrative in general ML is that performance scales with model size and data volume. In clinical AI, this relationship breaks down — and understanding why is critical to making good investment decisions about where to spend your budget.

Scaling laws do not hold with noisy labels

Neural scaling laws predict smooth performance improvement as model size and dataset size increase. But these laws assume clean labels. In clinical datasets, label noise is pervasive: crowdsourced annotations from non-specialists, silver labels from NLP pipelines, ICD codes used as proxies for diagnoses. When label noise exceeds a threshold, adding more data with the same noise level does not improve performance — it teaches the model to reproduce the noise with higher confidence. The model gets better at being wrong.

The quality multiplier

Multiple studies have demonstrated that smaller datasets with expert annotations outperform larger datasets with noisy labels. In one clinical text classification study, 2,000 physician-annotated examples matched the performance of 15,000 crowdsourced examples — a 7.5x data efficiency gain from annotation quality alone. In medical imaging, upgrading annotator expertise from general practitioner to subspecialist consistently yielded larger accuracy improvements than doubling the training set.

Reasoning traces as training signal

The most significant quality dimension is not just label correctness but label richness. Reasoning traces — structured records of the clinical logic behind a label — transform training data from “what to predict” into “how to reason.” Models trained on reasoning-annotated data show improved performance even on tasks that were not explicitly part of the training set, because the reasoning transfers. Our clinical data annotation guide covers this in depth.

The compounding cost of bad data

Bad labels do not just reduce model accuracy. They compound through the pipeline: noisy labels produce a noisy model, which generates noisy synthetic data or pseudo-labels, which are used to train the next model iteration. Each cycle amplifies the original error. In clinical AI, where models may eventually inform treatment decisions, this compounding effect is not just a performance problem — it is a safety problem. The most efficient investment for most clinical AI teams is not a bigger model or more compute. It is better training data.

07

TRANSFER LEARNING AND DOMAIN SHIFT

A model trained at one institution, on one patient population, using one EHR system does not automatically work at another. This is the domain shift problem, and it is arguably the most underestimated challenge in clinical AI deployment.

Institutional variation

Hospitals differ in documentation practices, coding conventions, formulary choices, and clinical workflows. A sepsis prediction model trained on Epic EHR data at an academic medical center may rely on features (specific nursing assessments, particular lab ordering patterns) that do not exist or are structured differently in a Cerner-based community hospital. These are not edge cases — they are the norm. Multi-site clinical AI studies consistently show 5–15% performance degradation when a model is deployed outside its training institution without adaptation.

Population bias

Clinical datasets reflect the demographics of the institution that collected them. A dermatology model trained primarily on lighter skin tones performs significantly worse on darker skin tones — a well-documented and ethically consequential failure mode. Similarly, models trained on adult populations may not generalize to pediatric patients, and models trained in the U.S. healthcare system may embed assumptions about treatment protocols that do not hold internationally. The fix is not algorithmic — it is data diversity. You need training data that represents the population you intend to serve.

What transfers and what does not

General medical knowledge (anatomy, physiology, pharmacology) transfers well across contexts. Clinical reasoning patterns transfer moderately — a model that learned to reason through differential diagnosis at one institution can often apply that reasoning framework elsewhere. What does not transfer: institution-specific documentation patterns, local treatment protocols, population-specific disease prevalence, and any feature derived from a specific EHR system's data model.

Practical mitigation

The most effective mitigation is multi-site training data: models trained on data from multiple institutions generalize better than single-site models, even when total data volume is the same. Federated learning enables this without pooling raw data — each institution trains locally and only model updates are shared. For teams that cannot access multi-site data, the minimum viable approach is prospective validation at the deployment site before go-live: measure performance on local data, identify where the model fails, and fine-tune on site-specific examples to close the gap.

08

THE REGULATORY LANDSCAPE FOR CLINICAL AI MODELS

Regulatory requirements are not an afterthought — they are engineering constraints that should shape model architecture, training pipeline, and data strategy decisions from the start. Retrofitting compliance is orders of magnitude more expensive than designing for it.

FDA Software as a Medical Device

In the U.S., clinical AI is regulated as Software as a Medical Device (SaMD). Classification depends on the seriousness of the condition the software addresses and its role in clinical decisions. A model that flags potential pneumothorax for radiologist review (decision support) faces lighter requirements than one that autonomously triages critical findings (decision making). As of 2025, the FDA has authorized over 950 AI/ML-enabled medical devices — the vast majority for radiology.

Locked vs. adaptive algorithms

A “locked” algorithm produces the same output for the same input every time — it does not learn or update after deployment. An “adaptive” algorithm continues to learn from new data. The FDA's predetermined change control plan (PCCP) framework, finalized in 2024, provides a pathway for adaptive algorithms, but it requires predefining the types of changes the algorithm can make, the retraining protocol, and the performance monitoring plan. This has direct implications for model architecture: if you want an adaptive clinical AI, you need to design the update mechanism and validation pipeline from day one.

EU AI Act

The EU AI Act classifies most clinical AI as “high risk,” requiring conformity assessments, technical documentation, data governance, human oversight, and post-market monitoring. The requirements on training data quality are particularly relevant: the Act mandates that training datasets be “relevant, sufficiently representative, and to the best extent possible, free of errors and complete.” This is not a vague aspiration — it is a legal requirement with enforcement mechanisms.

What this means for engineering

Three practical implications. First, data provenance: you need a complete audit trail for every training example — where it came from, how it was labeled, who labeled it, what quality checks were applied. Second, reproducibility: you need to be able to reproduce any version of your model from its training data and code. Third, monitoring: you need ongoing measurement of model performance in production, with defined thresholds for when performance degradation triggers revalidation. These requirements are easier to meet when they are designed into the system from the start rather than bolted on before a regulatory submission.

09

WHERE THE FIELD IS HEADING

Predicting the future of clinical AI is risky — but certain trends have enough momentum and institutional investment behind them that they are worth preparing for.

Agentic clinical AI

The shift from models that classify to agentic systems that take multi-step actions — ordering diagnostic tests, drafting treatment plans, scheduling follow-ups. Google's AMIE (Articulate Medical Intelligence Explorer) demonstrated a conversational diagnostic agent that outperformed primary care physicians in diagnostic accuracy in controlled settings. The data requirements for agentic AI are qualitatively different: you need training data that captures clinical workflows, not just clinical knowledge. This means annotating sequences of decisions, not isolated classifications.

Retrieval-augmented generation

RAG grounds model outputs in retrieved evidence — clinical guidelines, drug databases, institutional protocols. This directly addresses the hallucination problem by anchoring responses to verifiable sources. The challenge is building and maintaining the knowledge base: clinical guidelines change, drug formularies update, and institutional protocols vary. RAG shifts the data quality problem from training data to retrieval corpus curation — different work, but equally dependent on expert involvement.

Federated learning at scale

Federated learning enables training across institutions without pooling data. Early results are promising: federated models trained across 20+ hospitals have matched or exceeded models trained on any single site's data, while maintaining strict data separation. The infrastructure challenges (network bandwidth, heterogeneous compute, institutional coordination) are being addressed by platforms like NVIDIA FLARE and the federated learning framework from the MELLODDY consortium. As these platforms mature, multi-site clinical AI training will become the default rather than the exception.

Synthetic clinical data

Generating synthetic patient records to augment limited real data is an active area of research. Current approaches use fine-tuned LLMs to generate synthetic clinical notes or GANs to generate synthetic medical images. The promise is addressing data scarcity (rare diseases, small populations) and privacy (no real patients in the training set). The risk is subtle: synthetic data inherits and can amplify biases from the generation model, and models trained on synthetic data may learn artifacts of the generation process rather than real clinical patterns. Synthetic data is a complement to real expert-annotated data, not a replacement for it.

Multimodal foundation models

The next generation of clinical AI will process imaging, text, labs, and genomics jointly — mirroring how clinicians actually synthesize information. Early models like Med-Gemini and BiomedGPT demonstrate the feasibility. The data challenge is alignment: linking a chest X-ray to the radiology report to the lab results to the discharge summary for the same patient encounter, with consistent annotations across modalities. This is the hardest annotation problem in clinical AI, and it is the one that will define the next wave of model capabilities.

10

WHERE FABRICA FITS

Every section of this guide returns to the same point: the binding constraint on clinical AI is not model architecture — it is the quality, structure, and domain specificity of training data. Fabrica exists to solve that constraint.

Our physician network provides the expert annotation that clinical AI models need: structured reasoning traces for training data, gold-standard evaluation sets for benchmarking, and preference data for alignment. Whether you are fine-tuning a foundation model, building domain-specific evaluation benchmarks, or running RLHF pipelines, the quality of the human signal is what determines the quality of the model.

See our companion guides for deeper treatment of each stage of the clinical AI data pipeline: clinical data annotation, building gold-standard evaluation sets, and clinical model alignment.

REQUEST EARLY ACCESS