3745 - Large Language Models for Oncologic Note Summarization: A Blinded Pilot Study Informing a Non-Inferiority Trial
Presenter(s)

D. J. Wu1, S. Liu2, E. J. Hsu1, E. J. Smith1, J. C. Jagodinsky1, L. Xing3, and M. F. Gensheimer1; 1Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, 2Department of Biomedical Data Science, Stanford University, Stanford, CA, 3Department of Radiation Oncology, Stanford University, Stanford, CA
Purpose/Objective(s): Documentation burden is particularly acute in oncologic specialties owing to the complex and multidisciplinary care of cancer patients. Large language models (LLMs) hold promise in reducing such burdens but rigorous evaluation of their accuracy, safety, and usefulness remains lacking. We conducted a blinded pilot comparing a prompt-optimized LLM and resident physicians in creating summaries of Head & Neck oncologic consult notes prior to Radiation Oncology referral to inform a powered non-inferiority trial. Our primary hypothesis was that LLM summaries would be non-inferior to resident summarizations across multiple quality metrics.
Materials/Methods: We extracted 11,027 clinical notes from 57 de-identified Head & Neck (H&N) cancer patients using the Observational Medical Outcomes Partnership Common Data Model. After filtering for medical and surgical oncology consultations from before radiation, we selected one note from each of 12 patients. A consensus framework for summarization was developed by radiation oncology attending and resident physicians, requiring inclusion of histology, stage, pending workup, referrals, treatment plan, and follow-up. Four residents each summarized 3 unique consult notes, while an LLM (OpenAI o1), using the TextGrad “automatic differentiation” prompting technique and four resident summaries as training data, generated summaries for the other eight notes. Three residents blindly rated 6 matching LLM and 6 peer summaries (excluding their own) on 5-point Likert scales. Two-way random effects intraclass correlation coefficients (ICC(2)) and descriptive statistics were calculated.
Results: Eight unique patient consult notes were summarized by both an AI and resident physician. These summaries were rated by 2-3 raters for a total of 36 ratings. Mean ratings (±SEM) for AI versus human summaries showed comparable performance across all measures: completeness (4.44±0.21 vs 4.17±0.22), curation (4.22±0.23 vs 4.44±0.11), correctness (4.33±0.21 vs 4.28±0.22), patient safety (4.28±0.19 vs 4.44±0.12), and usefulness (4.33±0.17 vs 4.39±0.15). Interrater reliability was moderate to excellent, with highest agreement for human summary curation (ICC(2)=0.800, 95% CI [0.100, 1.000]) and moderate agreement for completeness (ICC(2)=0.500) in both types. Error analysis revealed similar rates between AI (2 knowledge gaps, 1 faulty logic, 2 patients total) and human summaries (all knowledge gap, 4 patients total).
Conclusion: This pilot demonstrates that LLM-generated summaries achieved comparable quality to resident-authored summaries across all measured domains, with both groups maintaining excellent performance (mean >4/5) and similar error rates, with moderate to excellent agreement among raters Such findings are promising and will help us design a larger study that is adequately powered to determine the non-inferiority of a prompt-optimized LLM in a real-world clinical summarization task.