CASE STUDY

WHEN STRUCTURED SCHEMAS TRANSFER EXPERTISE: AI ANNOTATION IN ENDOMETRIOSIS IMAGING

How a classification system designed for surgery is proving that expert knowledge in pelvic MRI can be codified — and why AI segmentation still can't use it at scale.

JARAMILLO-CARDOSO ET AL. · HARTH ET AL. · ESUR CONSENSUS · 2019–2025

THE FIELD

Endometriosis affects an estimated 10% of reproductive-age women worldwide — approximately 190 million people (WHO, 2025). Endometrium-like tissue grows outside the uterus, causing chronic pain, heavy bleeding, and infertility. Deep infiltrating endometriosis — where lesions penetrate more than 5mm beneath the peritoneal surface — involves structures across multiple anatomical compartments: uterosacral ligaments, bowel, bladder, rectovaginal septum, and parametria.

The average time from symptom onset to diagnosis is between 4 and 12 years depending on geography. A 2024 global scoping review of 22 studies found a worldwide average delay of 6.6 years. In the United Kingdom, a 2024 survey of 4,371 women found the average had risen to 8 years and 10 months — a 10-month increase since 2020. Seventy-eight percent reported that doctors dismissed their symptoms. Seventy percent visited their GP five or more times before receiving a diagnosis.

The economic burden is substantial. In the United States, endometriosis patients incur mean annual direct healthcare costs of $16,573 — more than three times the $4,733 for matched controls (Soliman et al., 2018). Approximately two-thirds of patients undergo surgery within 12 months of diagnosis.

MRI is recommended by the European Society of Uroradiology (ESUR, 2025) as the primary non-invasive imaging method when transvaginal ultrasound is inconclusive, before surgery, or when symptoms persist after treatment. Accurate MRI interpretation is critical for surgical planning: surgeons need to know which compartments are involved, the size and extent of lesions, and their proximity to critical structures like the ureters.

If AI could reliably detect and segment endometriotic lesions on MRI, it would shorten diagnostic delays, improve surgical planning, and extend specialist-level interpretation to settings where pelvic MRI expertise is scarce. But building that AI requires labeled training data. And labeling endometriosis on MRI is harder than it appears.

THE EXPERTISE GAP

Endometriosis MRI interpretation is not a general radiology skill. Deep infiltrating lesions are often subtle — appearing as small hypointense nodules on T2-weighted sequences, surrounded by complex pelvic anatomy where bowel, ligaments, and reproductive organs converge. The disease manifests across multiple anatomical compartments simultaneously, and each location presents different imaging characteristics and diagnostic challenges.

Non-specialist radiologists miss the majority of lesions

Jaramillo-Cardoso et al. (2019) compared three approaches to reading pelvic MRI for endometriosis in 59 surgically confirmed patients at a tertiary academic center. The results were stark.

Routine reads by non-specialist radiologists detected just 42.9% of surgically confirmed lesions, with 95.3% specificity. They were good at avoiding false positives — but missed more than half of all disease. Structured reported reads — where radiologists used a compartment-based template — jumped to 86.4% sensitivity but dropped to 45.9% specificity. The template forced radiologists to look in compartments they would otherwise skip, catching more disease but over-calling in the process. Structured expert reads — experienced specialists using the same structured approach — achieved 74.2% sensitivity with 81.8% specificity, the best balance of detection and precision.

The gap between 42.9% and 74.2% sensitivity represents real patients: nearly one in three women with surgically confirmed endometriosis whose disease would be detected by a specialist but missed by a generalist.

Even experienced readers disagree

A 2025 study at Mayo Clinic measured inter-reader agreement among seven abdominal radiologists experienced in endometriosis MRI across 751 patients. Fleiss' kappa was 0.5718 — classified as moderate agreement. These were not trainees or generalists. They were experienced specialists, and they agreed on barely half of borderline cases.

Detection accuracy also varies sharply by anatomical location. MRI achieves near-perfect sensitivity for bladder endometriosis (100%) and bowel involvement (100% sensitivity, 93.3% specificity), but drops to 80% for the rectovaginal septum and just 50% for ureteral involvement. Peritoneal endometriosis — superficial lesions scattered across peritoneal surfaces — is the hardest: inter-reader kappa of just 0.39 in a 2024 study of 412 women.

The pattern is consistent: endometriosis MRI is hard enough that expertise matters enormously, with specific anatomical locations posing particular challenges. This is not a domain where any trained radiologist can produce reliable annotations.

WHERE CURRENT AI STALLS

Segmentation accuracy is poor

In 2025, researchers published an AI-based MRI reading support program (AMP) for deep endometriosis diagnosis in Nature Scientific Reports. The system used nnU-Net models for lesion segmentation. The results: a mean Dice coefficient of 0.293 for endometriotic plaque segmentation and 0.580 for ovarian endometriotic cyst segmentation.

A Dice of 0.293 means the model's predicted lesion boundaries overlap with the expert reference by less than 30%. For clinical use — where surgeons need to know the precise extent of disease for operative planning — this is not yet actionable.

Detection works better, but datasets are still small

Classification — does this patient have endometriosis? — performs better than segmentation — where exactly are the lesions? A 2025 Mayo Clinic study trained a 3D-DenseNet-121 on 751 patients (395 with pathologically confirmed endometriosis, 356 controls) using multi-sequence MRI. The model achieved an F1 score of 0.881 and AUROC of 0.911 using ensemble methods — strong enough for screening, but not for surgical planning.

The largest multi-rater segmentation dataset published to date contains just 51 subjects with annotations from three raters, plus 81 subjects with single-rater labels (Nature Scientific Data, 2025). Modern medical image segmentation models typically require hundreds to thousands of labeled volumes for robust generalization across institutions and patient populations.

The annotation bottleneck is explicit

A 2024 multi-rater learning study explicitly acknowledged that even experienced clinicians struggle with accurate classification from MRI images. Their solution: a framework that extracts cleaner labels from multiple noisy annotations per training sample. But this approach bootstraps from existing expert labels — it cannot generate them. The quality ceiling is still set by the quality of the human annotations it starts from.

The AI reading support program confirmed the dynamic from a different angle: with AI assistance, radiologist recall for plaque detection improved from 0.73 to 0.91. AI can help experts find what they might miss — but training that AI in the first place requires labeled data that does not yet exist at scale.

THE SCHEMA THAT ALREADY WORKS

Here is where endometriosis imaging differs from most clinical AI annotation problems. A structured classification system already exists, has been validated, and has demonstrated that it transfers expert knowledge to non-specialists.

The #Enzian classification

In 2021, the #Enzian classification was published as a comprehensive system for describing and staging endometriosis. It divides the pelvis into compartments — A (rectovaginal/retrocervical), B (uterosacral ligaments/parametria, graded separately by side), C (rectum) — and grades lesion severity by size (1: <1cm, 2: 1–3cm, 3: >3cm). Additional categories cover ovarian endometriomas (O), adenomyosis (FA), bladder (FB), intestinal (FI), and ureteral (FU) involvement. The classification was originally designed for surgical documentation, but it has been increasingly applied to MRI. And the MRI application data is striking.

One hour of training, near-expert agreement

Harth et al. (2023) prospectively evaluated inter- and intraobserver agreement for the #Enzian classification on MRI across 50 consecutive patients. Three radiologists participated: two with 5–7 years of pelvic MRI experience, and one musculoskeletal radiologist with no pelvic MRI experience who received a single one-hour training session on the classification system.

The results: inter-reader agreement for DIE diagnosis across all three readers reached a Fleiss' kappa of 0.89 (excellent). For endometriomas: 0.93 (excellent). Between the two experienced readers, agreement on compartments A, B, and C was excellent (weighted kappa 0.84–0.89). Between each experienced reader and the previously inexperienced reader: weighted kappa 0.64–0.91 (substantial to excellent).

The study's conclusion: “radiologists without specific experience in pelvic MRI can achieve substantial to excellent agreement with experienced radiologists in the application of the #Enzian classification on MRI after only a short training and with guidance from explanatory illustrations.”

A 2024 retrospective study of 412 women confirmed these findings at larger scale, with inter-reader agreement of Cohen's kappa 0.75–0.96 for most compartments. The exception: peritoneal involvement (kappa 0.39) — the one category where the disease is genuinely too subtle and diffuse for current classification criteria to resolve.

Structured reporting doubles sensitivity

The Jaramillo-Cardoso data tells the same story from the detection side. Structured reporting — using a compartment-based checklist rather than free-text prose — improved sensitivity from 42.9% to 86.4%. The schema did not make radiologists smarter. It forced them to systematically evaluate every compartment, catching disease they would otherwise have overlooked.

The 2025 ESUR consensus guidelines — developed through a Delphi process involving 20 expert radiologists with at least five years of endometriosis imaging experience — now formally recommend structured compartment-based MRI reporting for endometriosis. Ninety-five percent of the expert panel agreed that using MRI classification is clinically useful. This is not theoretical. The evidence that structured schemas transfer diagnostic capability in endometriosis imaging is published, replicated, and codified in international guidelines.

THE GAP THAT REMAINS

If structured schemas work for clinical reading, why can't AI use them? Because clinical classification and pixel-level annotation are fundamentally different tasks.

Classification is not segmentation

The #Enzian classification tells a radiologist: “there is a grade 2 lesion in compartment C.” It does not tell an annotator: “draw the boundary of that lesion on each of the 40 axial MRI slices where it appears.” Clinical classification — the task that #Enzian and structured reporting improve — requires detecting the presence and approximate severity of disease. AI segmentation requires delineating exact boundaries, slice by slice, across heterogeneous lesion morphologies.

The 0.293 Dice coefficient for plaque segmentation shows where this distinction matters. Radiologists can agree that a lesion exists (kappa 0.89). They have far more trouble agreeing on precisely where it ends and normal tissue begins — especially when lesions are small, irregular, or involve tissue planes that change across imaging slices.

Peritoneal disease exposes the limit

Peritoneal endometriosis — superficial lesions on peritoneal surfaces — has inter-reader kappa of just 0.39, the lowest of any compartment. It is subtle, diffuse, and lacks the nodular morphology that makes deep disease visible. The #Enzian classification captures it (category P), but MRI-based assessment is unreliable enough that the 2023 Harth study omitted it from MRI evaluation entirely.

For AI, peritoneal disease is the frontier: the compartment most likely to be missed by untrained annotators, most likely to produce disagreement among experts, and most in need of the kind of iterative schema refinement that turns poor agreement into usable labels.

Datasets are fragmented across institutions

The ESUR consensus involved 20 expert radiologists across institutions. The Harth study used readers from two centers. The Mayo dataset was single-center. The multi-rater segmentation dataset spans two institutions. No shared annotation infrastructure connects these efforts. Each team builds its own labeling pipeline, defines its own boundary criteria, and produces labels that may not be compatible with other datasets. The expertise exists. The annotation infrastructure to aggregate it does not.

HOW FABRICA CHANGES THIS

Endometriosis imaging presents the annotation problem Fabrica was built for: a domain where structured expert knowledge demonstrably transfers to non-specialists, but no infrastructure exists to translate that transfer into labeled training data at scale.

From clinical classification to annotation schema

The #Enzian classification proves that compartment-level detection can be taught in an hour. Fabrica extends this by encoding expert-validated boundary criteria into annotation schemas: not just “is there a lesion in compartment C?” but “trace the lesion boundary on each slice where signal abnormality is visible, using these T2 hypointensity criteria, and flag slices where the boundary is ambiguous.” The expert designs the schema. The schema guides the annotator. The annotator produces pixel-level labels at a quality level that would otherwise require the expert's direct involvement on every case.

Calibration batches for the hard compartments

The 0.39 kappa for peritoneal disease is not a reason to abandon annotation — it is a signal that the schema needs iteration. Fabrica's calibration batches — small-scale annotation rounds where agreement is measured before scaling — are designed precisely for this. Run 20 cases through three annotators using the initial peritoneal criteria. Measure where they disagree. Refine the boundary definition. Re-run. This iterative cycle converts poor agreement into documented, improving agreement — and catches the problem before hundreds of cases are labeled with inconsistent criteria.

Continuous agreement monitoring

The Mayo study found kappa of 0.5718 among experienced radiologists — but only discovered this after collecting all annotations. Fabrica measures inter-annotator agreement continuously during annotation, surfacing which compartments, which lesion morphologies, and which boundary definitions drive disagreement as it happens. When three annotators disagree on where a uterosacral ligament lesion ends, the platform flags it for expert review rather than silently producing a noisy label.

Cross-institutional annotator networks

The ESUR consensus demonstrated that 20 experts across institutions can agree on classification criteria. Fabrica operationalizes this for annotation: connecting pelvic MRI specialists across centers into collaborative workflows where each annotator works independently, disagreements are preserved as signal rather than collapsed into majority votes, and the distributed labels reflect the real-world variation a deployed model will encounter.

The bottom line

Endometriosis imaging AI is stuck at a segmentation Dice of 0.293 for the lesions surgeons most need to see. Not because models are not powerful enough — nnU-Net is a solved architecture. Not because the clinical need is not there — 190 million women are affected, with an average diagnostic delay approaching seven years. Because the expert knowledge needed to annotate pelvic MRI accurately has not been channeled into annotation infrastructure that produces training data at scale. The #Enzian classification proved the knowledge is transferable. Fabrica builds the pipeline to transfer it.

REQUEST EARLY ACCESS

SOURCES

Jaramillo-Cardoso, A. et al. (2019). Pelvic MRI in the diagnosis and staging of pelvic endometriosis: added value of structured reporting and expertise. Abdom Radiol.

Harth, S. et al. (2023). Application of the #Enzian classification for endometriosis on MRI: prospective evaluation of inter- and intraobserver agreement. Front Med, 10, 1303593.

ESUR Consensus (2025). MRI for endometriosis: indications, reporting, and classifications; protocol, lexicon, and compartment-based analysis. Eur Radiol.

(2025). Development of an AI-based magnetic resonance imaging reading support program (AMP) for deep endometriosis diagnosis. Sci Rep (Nature).

(2025). A Multi-Modal Pelvic MRI Dataset for Deep Learning-Based Pelvic Organ Segmentation in Endometriosis. Sci Data (Nature).

Soliman, A.M. et al. (2018). Real-World Evaluation of Direct and Indirect Economic Burden Among Endometriosis Patients in the United States. Adv Ther, 35, 408–423.

← BACK TO ALL CASE STUDIES