ML MODELS FOR CLINICAL AI
What actually works, what doesn't, and why data quality matters more than model architecture in clinical AI.
THE MODEL ZOO: WHAT CLINICAL AI TEAMS ARE ACTUALLY USING
A general ML guide would walk you through CNNs, RNNs, transformers, and diffusion models as abstract categories. That framing is not useful if you are building clinical AI. What matters is what teams are actually deploying, what is stuck in research, and what has quietly failed. The landscape has converged more than most people realize.
Language models for clinical text
The dominant paradigm is foundation modelsA large model pre-trained on broad data that can be adapted to downstream tasks through fine-tuning or prompting. GPT-4, LLaMA, and Med-PaLM are examples. The "foundation" refers to the base on which task-specific capabilities are built. adapted for clinical use. On the proprietary side, GPT-4 and Med-PaLM 2 demonstrated that general-purpose LLMsLarge language model — an AI system trained on vast amounts of text that can generate and understand natural language. GPT-4, Claude, and Gemini are examples. could score at or above physician level on medical board examinations. On the open-source side, models like BioMedLM (trained on PubMed abstracts), PMC-LLaMA (continued pretraining of LLaMA on biomedical literature), and Meditron (fine-tuned on curated clinical guidelines) represent attempts to bring domain specificity without API dependency.
For clinical NLP specifically, BERT-family models remain surprisingly relevant. PubMedBERT, ClinicalBERT, and GatorTron (trained on over 90 billion words of clinical text from the University of Florida Health system) still outperform much larger generalist models on tasks like named entity recognition, relation extraction, and clinical note classification — tasks where you need precision on domain-specific structure, not broad generative ability.
Vision models for medical imaging
Medical imaging has its own model lineage. RetFound, a self-supervised vision transformer trained on 1.6 million retinal images, demonstrated that foundation models could work for medical vision — not just NLP. BiomedCLIP aligns medical images with text descriptions, enabling zero-shot classification. For segmentation, the Segment Anything Model (SAM) has been adapted to medical imaging with MedSAM, though its performance on fine-grained clinical structures (e.g., tumor margins) still lags behind purpose-built architectures like nnU-Net.
Multimodal models
The frontier is multimodal: models that combine imaging, text, labs, and genomics. Med-Gemini processes interleaved medical images and text. LLaVA-Med fine-tunes a vision-language model on biomedical image-text pairs. These are research-stage — no team is deploying them in production at scale — but they represent where the field is heading.
The thesis
Across all of these categories, the pattern is the same: the architecture is rarely the differentiator. Most teams working on clinical AI use some variant of a transformer — whether it is a BERT-family encoder, a GPT-family decoder, or a vision transformer. What separates the models that work from the ones that do not is the quality, domain specificity, and structure of the training data. The rest of this guide is about why.
FOUNDATION MODELS MEET CLINICAL REALITY
GPT-4 scores 86% on USMLE Step 1. Med-PaLM 2 reaches expert level on MedQA. These headlines created an expectation that foundation models are close to clinical deployment. The reality is more nuanced — and the gap between benchmark performance and clinical utility is where most projects stall.
The benchmark-deployment gap
Medical board exams are multiple-choice questionsMultiple-choice question — a test format where the answer is selected from a fixed set of options. Easier for models than open-ended clinical reasoning because it constrains the answer space. with a single correct answer, drawn from textbook knowledge. Real clinical practice involves incomplete information, competing plausible diagnoses, time pressure, and consequential tradeoffs. A model that aces Step 1 can still fail at determining whether a specific patient's chest pain is cardiac or musculoskeletal when the history is ambiguous and the ECG is borderline.
The disconnect is structural: MCQMultiple-choice question — a test format where the answer is selected from a fixed set of options. Easier for models than open-ended clinical reasoning because it constrains the answer space. benchmarks test knowledge retrieval, not multi-step clinical inferenceReasoning that chains multiple logical steps — observing findings, generating hypotheses, integrating new evidence, and narrowing possibilities. Most real clinical decisions require this.. Our evaluation guide covers this problem in depth.
Hallucination in safety-critical contexts
HallucinationWhen a model generates plausible-sounding but factually incorrect information. In clinical AI, hallucinations can invent symptoms, cite non-existent studies, or recommend contraindicated treatments — with full confidence. is an inconvenience in a chatbot and a liability in clinical AI. Foundation models confidently generate drug interactions that do not exist, cite fabricated clinical trials, and recommend dosages that are off by orders of magnitude. Studies have found that even state-of-the-art models hallucinate in 5–15% of medical responses — a rate that is disqualifying for any application with patient-facing consequences.
The problem is compounded by calibration: models do not reliably distinguish between responses they are confident about and ones they are guessing on. A hallucinated drug interaction carries the same confident tone as a well-established one.
Prompting vs. fine-tuning vs. training from scratch
There is a spectrum of effort and control. Prompting a general-purpose model (GPT-4 with a clinical system prompt) is the lowest-effort approach but offers the least control over behavior, no guarantees about data handling, and inherits the model's full hallucination surface. Fine-tuning a foundation model on domain-specific data (LoRA on LLaMA with clinical notes) gives more control and can be done on self-hosted infrastructure. Training a domain-specific model from scratch is the most expensive but gives full control over data provenance, behavior, and compliance.
Compliance as architecture constraint
For any application involving protected health informationProtected Health Information — any individually identifiable health data including names, dates, medical record numbers, and any information that could identify a patient., the question “which model should we use?” is often answered by “which models can we legally send this data to?” This rules out most API-based foundation models for production workloads on clinical data, pushing teams toward self-hosted open-source models or on-premises deployments. HIPAAHealth Insurance Portability and Accountability Act — U.S. federal law setting standards for protecting patient health information from being disclosed without consent. compliance is not a feature request — it is an architectural constraint that narrows the model space before performance is even considered.
CLINICAL NLP IS ITS OWN DISCIPLINE
Clinical NLPNatural language processing applied to clinical text — EHR notes, discharge summaries, radiology reports. Faces unique challenges: pervasive abbreviations, negation, copy-forward artifacts, and context-dependent meaning. is not just NLP applied to medicine. Clinical text has properties that fundamentally break assumptions baked into general-purpose language models — and ignoring them produces models that look competent on benchmarks but fail in practice.
Negation and assertion
“No evidence of malignancy” contains the word “malignancy” but asserts its absence. Clinical text is saturated with negation, hedging, and conditional statements: “denies chest pain,” “unlikely to represent metastatic disease,” “cannot rule out PE.” A model that treats token presence as semantic presence will extract the opposite of what the note says. This is not an edge case — studies have found that up to 50% of clinical concepts in discharge summaries are negated.
Abbreviation ambiguity
“MS” means multiple sclerosis in neurology, mitral stenosis in cardiology, mental status in psychiatry, and morphine sulfate in pharmacy. “PT” is either physical therapy, prothrombin time, or patient depending on context. A study catalogued over 174,000 abbreviation expansions used in clinical text. General-purpose models trained on internet text have no reliable mechanism for disambiguating these — the correct expansion depends on the clinical context (department, note type, surrounding findings) in ways that require domain-specific training data.
Copy-forward artifacts
EHR systems encourage copy-forward: clinicians paste text from a previous note and modify it. This creates notes where the same paragraph appears across multiple encounters, sometimes with subtle edits and sometimes without any changes at all. A model trained on these notes cannot distinguish between a finding that is actively present and one that was pasted from three weeks ago. Copy-forward also inflates apparent data volume — you may have 10,000 notes but only 3,000 unique clinical observations.
Temporal reasoning
Clinical reasoning is inherently temporal. A potassium of 5.8 means something different if it is trending up from 4.5 versus down from 7.2. The relationship between a symptom onset, a medication change, and a lab result requires understanding time — and clinical notes encode time in inconsistent, often implicit ways. “Since starting lisinopril” requires the model to know when lisinopril was started, which may be documented in a different note entirely.
Tokenization challenges
General-purpose tokenizersSplitting text into subword units (tokens) for model input. Clinical text challenges tokenizers: abbreviations like "q4h" or "PRN" may be split incorrectly, and medical terminology may be under-represented in general-purpose vocabularies. were trained on internet text. Clinical abbreviations like “q4h” (every 4 hours), “PRN” (as needed), and “bid” (twice daily) may be split into meaningless subword tokens. Drug names, gene symbols, and ICD codes are similarly under-represented. This is not fatal — but it means that models using general tokenizers require more clinical training data to learn representations that a domain-specific tokenizer would capture natively. Some teams address this with continued pretrainingExtending a foundation model's pre-training phase on domain-specific text (e.g., clinical notes, biomedical literature) before task-specific fine-tuning. Adapts the model's internal representations to the target domain. on clinical corpora, which helps the model build better internal representations even with a suboptimal tokenizer.
MEDICAL VISION: WHAT'S SOLVED AND WHAT ISN'T
Computer vision in medicine is not one problem. It is dozens of problems at different stages of maturity, with different data constraints, and different relationships between model architecture and annotation quality.
Radiology: relatively mature
Chest X-ray interpretation has the most FDA-cleared AI devices of any imaging modality. Models like CheXNet (2017) and its successors demonstrated human-level performance on specific findings (pneumothorax detection, cardiomegaly classification). The data infrastructure is mature: PACSPicture Archiving and Communication System — the standard digital infrastructure hospitals use to store, retrieve, and distribute medical images like X-rays, CTs, and MRIs. systems provide standardized DICOM images, and datasets like CheXpert (224,316 chest radiographs) and MIMIC-CXR (377,110 images with free-text reports) provide large-scale training data. The remaining challenges are in generalization across scanner types, patient populations, and the long tail of rare findings.
Pathology: emerging but complex
Digital pathology is earlier in its AI journey, largely because the data is harder. A single whole-slide image can be 40,000 × 40,000 pixels — gigapixel scale. You cannot feed this into a standard CNN. The dominant approach is multiple-instance learning: tile the slide into patches, encode each patch independently, then aggregate patch-level features for a slide-level prediction. Foundation models like CONCH and UNI (trained on hundreds of thousands of histopathology slides) are changing this by providing general-purpose patch encoders, but the aggregation strategy and the quality of pathologist annotations at the slide level remain the performance ceiling.
The annotation ceiling
Across all imaging modalities, there is a consistent finding: model performance plateaus when label quality plateaus, not when model capacity is exhausted. A study on diabetic retinopathy grading found that upgrading annotations from general-ophthalmologist to retina-specialist quality improved model AUC by more than switching from ResNet-50 to a model with 4x the parameters. The same pattern appears in dermatology, where diagnostic accuracy correlates more strongly with annotator expertise than with model architecture.
Domain adaptation is non-trivial
Domain adaptationAdjusting a model trained on one data distribution (e.g., natural images) to perform well on a different but related distribution (e.g., medical images). Bridging this gap is non-trivial in clinical AI. from natural images to medical images is not a simple transfer. ImageNet features (textures, edges, object shapes) transfer poorly to histopathology slides (cellular morphology, staining patterns, tissue architecture) or to CT scans (3D volumetric data, Hounsfield units, variable slice thickness). Models pretrained on natural images require substantial fine-tuning to be useful in clinical imaging — and that fine-tuning is bottlenecked by the quality of clinical annotations, not compute.
THE FINE-TUNING DECISION TREE
Choosing how to adapt a model to your clinical task is not a purely technical decision. It depends on data volume, task specificity, compliance constraints, and budget — and the right answer is different for every team.
Full fine-tuning
Update all model parameters on your clinical dataset. This gives maximum adaptation but requires the most compute, the most labeled data (typically tens of thousands of examples), and risks catastrophic forgetting — where the model loses general capabilities while overfitting to the fine-tuning distribution. Best suited for teams with large proprietary clinical datasets and dedicated ML infrastructure.
LoRA and QLoRA
LoRALow-Rank Adaptation — a parameter-efficient fine-tuning method that freezes most model weights and trains small low-rank matrices instead. Dramatically reduces compute and memory while preserving most of the performance of full fine-tuning. freezes the base model and trains small low-rank adapter matrices — typically less than 1% of the original parameters. QLoRA adds 4-bit quantization of the base model, reducing memory requirements further. This is the most common approach for clinical AI teams today: it works with hundreds to low thousands of examples, runs on a single GPU, and preserves the base model's general capabilities. The tradeoff is reduced expressiveness on tasks that require deep domain adaptation.
Continued pretraining
Continued pretrainingExtending a foundation model's pre-training phase on domain-specific text (e.g., clinical notes, biomedical literature) before task-specific fine-tuning. Adapts the model's internal representations to the target domain. extends the base model's training on domain-specific corpora (PubMed abstracts, clinical notes, biomedical textbooks) before task-specific fine-tuning. This reshapes the model's internal representations to better capture clinical language. PMC-LLaMA and Meditron both used this approach. It is more expensive than LoRA but produces models with fundamentally better clinical language understanding. Worth the investment when your downstream tasks span multiple clinical applications rather than a single narrow task.
Prompt tuning and in-context learning
Soft prompt tuning trains a small set of continuous vectors prepended to the input, leaving the entire model frozen. This requires the least labeled data and compute, but also provides the least adaptation. In-context learning (providing examples in the prompt) requires no training at all but is limited by context window size and produces inconsistent results. Both approaches are useful for prototyping and proof-of-concept work, less so for production systems.
The decision framework
Ask three questions. How much labeled clinical data do you have? Under 500 examples: prompt tuning or in-context learning. 500–5,000: LoRA. 5,000+: full fine-tuning. Does patient data touch the model? If yes, self-hosted only — this eliminates API-based models and most cloud fine-tuning services. How many downstream tasks? Single task: LoRA directly. Multiple tasks across clinical domains: continued pretraining first, then task-specific LoRA adapters on top.
WHY DATA QUALITY BEATS MODEL SCALE
The dominant narrative in general ML is that performance scales with model size and data volume. In clinical AI, this relationship breaks down — and understanding why is critical to making good investment decisions about where to spend your budget.
Scaling laws do not hold with noisy labels
Neural scaling laws predict smooth performance improvement as model size and dataset size increase. But these laws assume clean labels. In clinical datasets, label noise is pervasive: crowdsourced annotations from non-specialists, silver labels from NLP pipelines, ICD codes used as proxies for diagnoses. When label noise exceeds a threshold, adding more data with the same noise level does not improve performance — it teaches the model to reproduce the noise with higher confidence. The model gets better at being wrong.
The quality multiplier
Multiple studies have demonstrated that smaller datasets with expert annotations outperform larger datasets with noisy labels. In one clinical text classification study, 2,000 physician-annotated examples matched the performance of 15,000 crowdsourced examples — a 7.5x data efficiency gain from annotation quality alone. In medical imaging, upgrading annotator expertise from general practitioner to subspecialist consistently yielded larger accuracy improvements than doubling the training set.
Reasoning traces as training signal
The most significant quality dimension is not just label correctness but label richness. Reasoning tracesStructured records of the step-by-step thinking behind a clinical decision — not just the final answer, but the evidence considered, alternatives weighed, and logic applied. — structured records of the clinical logic behind a label — transform training data from “what to predict” into “how to reason.” Models trained on reasoning-annotated data show improved performance even on tasks that were not explicitly part of the training set, because the reasoning transfers. Our clinical data annotation guide covers this in depth.
The compounding cost of bad data
Bad labels do not just reduce model accuracy. They compound through the pipeline: noisy labels produce a noisy model, which generates noisy synthetic data or pseudo-labels, which are used to train the next model iteration. Each cycle amplifies the original error. In clinical AI, where models may eventually inform treatment decisions, this compounding effect is not just a performance problem — it is a safety problem. The most efficient investment for most clinical AI teams is not a bigger model or more compute. It is better training data.
TRANSFER LEARNING AND DOMAIN SHIFT
A model trained at one institution, on one patient population, using one EHR system does not automatically work at another. This is the domain shiftWhen the data a model encounters in real-world use differs meaningfully from what it was trained on. Real clinical queries are often more diverse and adversarial than training data. problem, and it is arguably the most underestimated challenge in clinical AI deployment.
Institutional variation
Hospitals differ in documentation practices, coding conventions, formulary choices, and clinical workflows. A sepsis prediction model trained on Epic EHR data at an academic medical center may rely on features (specific nursing assessments, particular lab ordering patterns) that do not exist or are structured differently in a Cerner-based community hospital. These are not edge cases — they are the norm. Multi-site clinical AI studies consistently show 5–15% performance degradation when a model is deployed outside its training institution without adaptation.
Population bias
Clinical datasets reflect the demographics of the institution that collected them. A dermatology model trained primarily on lighter skin tones performs significantly worse on darker skin tones — a well-documented and ethically consequential failure mode. Similarly, models trained on adult populations may not generalize to pediatric patients, and models trained in the U.S. healthcare system may embed assumptions about treatment protocols that do not hold internationally. The fix is not algorithmic — it is data diversity. You need training data that represents the population you intend to serve.
What transfers and what does not
General medical knowledge (anatomy, physiology, pharmacology) transfers well across contexts. Clinical reasoning patterns transfer moderately — a model that learned to reason through differential diagnosis at one institution can often apply that reasoning framework elsewhere. What does not transfer: institution-specific documentation patterns, local treatment protocols, population-specific disease prevalence, and any feature derived from a specific EHR system's data model.
Practical mitigation
The most effective mitigation is multi-site training data: models trained on data from multiple institutions generalize better than single-site models, even when total data volume is the same. Federated learningA machine learning approach where models are trained across multiple institutions without sharing raw patient data. Each site trains locally; only model updates are shared. enables this without pooling raw data — each institution trains locally and only model updates are shared. For teams that cannot access multi-site data, the minimum viable approach is prospective validation at the deployment site before go-live: measure performance on local data, identify where the model fails, and fine-tune on site-specific examples to close the gap.
THE REGULATORY LANDSCAPE FOR CLINICAL AI MODELS
Regulatory requirements are not an afterthought — they are engineering constraints that should shape model architecture, training pipeline, and data strategy decisions from the start. Retrofitting compliance is orders of magnitude more expensive than designing for it.
FDA Software as a Medical Device
In the U.S., clinical AI is regulated as Software as a Medical Device (SaMD)Software as a Medical Device — software intended for medical purposes without being part of a hardware device. Subject to FDA regulation with classification tiers (Class I, II, III) based on patient risk.. Classification depends on the seriousness of the condition the software addresses and its role in clinical decisions. A model that flags potential pneumothorax for radiologist review (decision support) faces lighter requirements than one that autonomously triages critical findings (decision making). As of 2025, the FDA has authorized over 950 AI/ML-enabled medical devices — the vast majority for radiology.
Locked vs. adaptive algorithms
A “locked” algorithm produces the same output for the same input every time — it does not learn or update after deployment. An “adaptive” algorithm continues to learn from new data. The FDA's predetermined change control plan (PCCP) framework, finalized in 2024, provides a pathway for adaptive algorithms, but it requires predefining the types of changes the algorithm can make, the retraining protocol, and the performance monitoring plan. This has direct implications for model architecture: if you want an adaptive clinical AI, you need to design the update mechanism and validation pipeline from day one.
EU AI Act
The EU AI Act classifies most clinical AI as “high risk,” requiring conformity assessments, technical documentation, data governance, human oversight, and post-market monitoring. The requirements on training data quality are particularly relevant: the Act mandates that training datasets be “relevant, sufficiently representative, and to the best extent possible, free of errors and complete.” This is not a vague aspiration — it is a legal requirement with enforcement mechanisms.
What this means for engineering
Three practical implications. First, data provenance: you need a complete audit trail for every training example — where it came from, how it was labeled, who labeled it, what quality checks were applied. Second, reproducibility: you need to be able to reproduce any version of your model from its training data and code. Third, monitoring: you need ongoing measurement of model performance in production, with defined thresholds for when performance degradation triggers revalidation. These requirements are easier to meet when they are designed into the system from the start rather than bolted on before a regulatory submission.
WHERE THE FIELD IS HEADING
Predicting the future of clinical AI is risky — but certain trends have enough momentum and institutional investment behind them that they are worth preparing for.
Agentic clinical AI
The shift from models that classify to agentic systemsAI systems that take autonomous multi-step actions — ordering tests, adjusting plans, scheduling follow-ups — rather than just generating text or classifications. A shift from advisory to interventional AI. that take multi-step actions — ordering diagnostic tests, drafting treatment plans, scheduling follow-ups. Google's AMIE (Articulate Medical Intelligence Explorer) demonstrated a conversational diagnostic agent that outperformed primary care physicians in diagnostic accuracy in controlled settings. The data requirements for agentic AI are qualitatively different: you need training data that captures clinical workflows, not just clinical knowledge. This means annotating sequences of decisions, not isolated classifications.
Retrieval-augmented generation
RAGRetrieval-Augmented Generation — a technique that grounds model outputs by retrieving relevant documents from a knowledge base before generating responses. Reduces hallucination by anchoring outputs to source material. grounds model outputs in retrieved evidence — clinical guidelines, drug databases, institutional protocols. This directly addresses the hallucination problem by anchoring responses to verifiable sources. The challenge is building and maintaining the knowledge base: clinical guidelines change, drug formularies update, and institutional protocols vary. RAG shifts the data quality problem from training data to retrieval corpus curation — different work, but equally dependent on expert involvement.
Federated learning at scale
Federated learningA machine learning approach where models are trained across multiple institutions without sharing raw patient data. Each site trains locally; only model updates are shared. enables training across institutions without pooling data. Early results are promising: federated models trained across 20+ hospitals have matched or exceeded models trained on any single site's data, while maintaining strict data separation. The infrastructure challenges (network bandwidth, heterogeneous compute, institutional coordination) are being addressed by platforms like NVIDIA FLARE and the federated learning framework from the MELLODDY consortium. As these platforms mature, multi-site clinical AI training will become the default rather than the exception.
Synthetic clinical data
Generating synthetic patient records to augment limited real data is an active area of research. Current approaches use fine-tuned LLMs to generate synthetic clinical notes or GANs to generate synthetic medical images. The promise is addressing data scarcity (rare diseases, small populations) and privacy (no real patients in the training set). The risk is subtle: synthetic data inherits and can amplify biases from the generation model, and models trained on synthetic data may learn artifacts of the generation process rather than real clinical patterns. Synthetic data is a complement to real expert-annotated data, not a replacement for it.
Multimodal foundation models
The next generation of clinical AI will process imaging, text, labs, and genomics jointly — mirroring how clinicians actually synthesize information. Early models like Med-Gemini and BiomedGPT demonstrate the feasibility. The data challenge is alignment: linking a chest X-ray to the radiology report to the lab results to the discharge summary for the same patient encounter, with consistent annotations across modalities. This is the hardest annotation problem in clinical AI, and it is the one that will define the next wave of model capabilities.
WHERE FABRICA FITS
Every section of this guide returns to the same point: the binding constraint on clinical AI is not model architecture — it is the quality, structure, and domain specificity of training data. Fabrica exists to solve that constraint.
Our physician network provides the expert annotation that clinical AI models need: structured reasoning tracesStructured records of the step-by-step thinking behind a clinical decision — not just the final answer, but the evidence considered, alternatives weighed, and logic applied. for training data, gold-standard evaluation sets for benchmarking, and preference data for alignment. Whether you are fine-tuning a foundation model, building domain-specific evaluation benchmarks, or running RLHFReinforcement Learning from Human Feedback — a training technique where human preferences (e.g., "response A is better than B") guide the model toward producing better outputs. pipelines, the quality of the human signal is what determines the quality of the model.
See our companion guides for deeper treatment of each stage of the clinical AI data pipeline: clinical data annotation, building gold-standard evaluation sets, and clinical model alignment.
REQUEST EARLY ACCESS