GUIDE

CLINICAL MODEL ALIGNMENT

How physician preference data drives RLHF and DPO pipelines for clinical LLMs — and why alignment is an ongoing requirement, not a one-time task.

BEYOND SUPERVISED FINE-TUNING

Supervised fine-tuning teaches a model what to say. Alignment teaches it what not to say — and how to navigate the space between helpful and harmful. For clinical LLMs, this distinction is the difference between a useful tool and a liability.

A fine-tuned clinical model may produce fluent, medically knowledgeable responses that are nonetheless unsafe: recommending a contraindicated drug combination, providing a diagnosis with false confidence, or failing to recommend urgent follow-up when a presentation warrants it. These failures aren't knowledge gaps — the model has the relevant information — they are alignment failures.

Alignment methods like RLHF (reinforcement learning from human feedback) and DPO (direct preference optimization) address this by training models to prefer outputs that physicians judge as safe, accurate, and clinically appropriate. The critical input to these pipelines is physician preference data.

HOW PREFERENCE ANNOTATION WORKS

Physicians are presented with pairs of model outputs for the same clinical prompt and asked to choose the better response — or rank them on specific dimensions. These pairwise preferences train a reward model that serves as a proxy for expert clinical judgment during reinforcement learning.

RLHF pipeline

The standard RLHF pipeline has three stages: supervised fine-tuning on domain-specific clinical data, training a reward model on physician preference pairs, and policy optimization (typically PPO) to maximize the learned reward. The reward model is the bottleneck — its quality is entirely determined by the quality and volume of physician preference data.

Direct preference optimization

DPO simplifies the pipeline by eliminating the separate reward model, directly optimizing the language model on preference pairs. This reduces computational cost and training instability but places even more weight on preference data quality — there is no reward model to smooth over noisy or inconsistent annotations.

Evaluation dimensions

Clinical preference annotation is most informative when it goes beyond a single “which is better” judgment. Effective schemes ask annotators to rate on multiple dimensions: clinical accuracy, safety, completeness, reasoning quality, and communication appropriateness. This multi-dimensional signal allows more targeted alignment — improving safety without sacrificing helpfulness, for example.

THE SAFETY-HELPFULNESS TRADEOFF

Clinical models face a specific alignment tension that general-purpose models don't. They must avoid providing unsafe medical advice — refuse when uncertain, flag dangerous recommendations, defer to human judgment on edge cases — while remaining useful enough that clinicians actually adopt them.

Over-refusal

Models aligned too aggressively toward safety develop a pattern of over-refusal: declining to answer benign clinical questions, appending excessive disclaimers to straightforward responses, or deferring to “consult your physician” on questions that a clinical decision support tool should be able to address. This makes the tool useless in practice, regardless of how safe it is in theory.

Finding the boundary

Recent work combining Kahneman-Tversky Optimization with DPO has shown 42% improvement in safety-related metrics by targeting this tradeoff directly. The key insight is that the boundary between caution and usefulness isn't a technical parameter to tune — it's a clinical judgment that requires physician input to define.

Preference data that maps this boundary — cases where the physician says “this refusal is appropriate” vs. “this refusal is unhelpful” — is among the most valuable alignment data you can collect. It requires annotators who understand clinical risk in practice, not just in principle.

DESIGNING PREFERENCE TASKS FOR CLINICIANS

Physician time is expensive. Preference annotation tasks need to be designed for efficiency without sacrificing signal quality.

Prompt selection

Not all prompts are equally informative for alignment. Prioritize prompts where the model currently fails — edge cases, safety boundaries, ambiguous clinical scenarios. Prompts where the model already produces consistently good outputs add little alignment signal. Active selection of informative prompts reduces the volume of annotation needed.

Response pair generation

The model outputs presented for comparison should exhibit meaningful differences along the dimensions you care about. Two near-identical responses waste annotator time. Pairs should be sampled to span the quality spectrum: one clearly better, one clearly worse, with the interesting cases in between where physician judgment resolves genuine ambiguity.

Annotation interface

Clinical annotators should see both responses side by side with the full clinical context visible. Asking them to rate on 3–5 specific dimensions (accuracy, safety, reasoning, completeness, tone) takes marginally longer than a single preference but produces substantially richer signal. A skilled physician can complete a well-designed preference task in 2–4 minutes. Poorly designed tasks take 10+ minutes and produce lower-quality data.

ITERATIVE POST-DEPLOYMENT ALIGNMENT

Pre-deployment alignment — training on preference data before release — is necessary but not sufficient. Models encounter adversarial and edge-case inputs in deployment that are impossible to anticipate in a training set.

The deployment distribution shift

Real users interact with clinical AI in ways that pre-deployment testing doesn't capture: unusual phrasings, adversarial probes, novel clinical scenarios, or combinations of conditions that weren't represented in training data. This distribution shift means iterative alignment pipelines must route a sample of live model outputs back to physician reviewers, generating fresh preference data that captures these deployment-specific failure modes.

Feedback loops

The most effective approach is a continuous feedback loop: deploy, sample outputs, collect physician preferences on the sampled outputs, update the reward model, fine-tune, redeploy. This cycle doesn't need to run daily, but it should run regularly — clinical guidelines update, model behavior drifts with use, and new failure modes emerge.

This means preference annotation is not a one-time project with a completion date. It is an ongoing data requirement for any clinical AI system that remains in production. The teams that build this into their operating model from the start avoid the more expensive alternative: discovering alignment failures through adverse events.

COMMON PITFALLS

Using non-clinical annotators

General-purpose RLHF annotators can judge fluency and helpfulness. They cannot judge clinical accuracy, safety implications, or whether a recommendation is appropriate for a specific patient context. Clinical alignment requires clinical annotators. There is no shortcut.

Optimizing for a single dimension

Aligning exclusively on safety produces over-refusal. Aligning exclusively on helpfulness produces unsafe outputs. The preference data must capture the multidimensional nature of clinical quality — and the reward model must be trained to balance these dimensions.

Insufficient volume

Reward models trained on small preference datasets overfit to annotator idiosyncrasies rather than learning generalizable clinical judgment. The minimum viable volume depends on domain complexity, but teams consistently underestimate how much preference data is needed for stable alignment.

Treating alignment as a one-time cost

The most expensive pitfall. Models trained and deployed without a plan for ongoing alignment accumulate drift and failure modes silently. By the time the problem is visible — an adverse event, a user complaint, a regulatory review — the remediation cost far exceeds what continuous alignment would have cost.

WHERE FABRICA FITS

Fabrica provides the physician preference data pipeline that clinical AI teams need for alignment. Our network of clinical annotators performs multi-dimensional preference annotation on your model outputs — rating accuracy, safety, reasoning, and completeness — with the clinical expertise that general annotation workforces cannot provide.

Alignment is one piece of the clinical AI data pipeline. See our companion guides on clinical data annotation and building gold-standard evaluation sets.

REQUEST EARLY ACCESS

← BACK TO ALL GUIDES