Main Session
Sep 30
PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3720 - Large Language Models in Radiation Oncology Physics Education: Performance Assessment Using the 2024 RAPHEX Exam

04:00pm - 05:00pm PT
Hall F
Screen: 11
POSTER

Presenter(s)

Aaron Segura, MS, BS - University of New Mexico Health Sciences Center, Albuquerque, NM

A. C. Segura1,2, D. Perkins1,2, V. A. Dumane3, T. Liu3, and J. Runnels3; 1Department of Biomedical Engineering, University of New Mexico School of Engineering, Albuquerque, NM, 2Department of Internal Medicine, Division of Radiation Oncology, University of New Mexico School of Medicine, Albuquerque, NM, 3Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY

Purpose/Objective(s): Large language models (LLMs) are increasingly used in medical education, yet their effectiveness in radiation oncology physics remains underexamined. As of 2025, multiple LLMs with varying capabilities are widely available. This study systematically benchmarks six leading models—GPT-4o, GPT-4-turbo, GPT-3.5-turbo, Gemini 1.5 Pro, Claude Sonnet 3.5, and DeepSeek-R1—using the 2024 RAPHEX Therapy Examination to assess their suitability for medical physics education. We evaluate accuracy, mathematical reasoning, explanation similarity, and response consistency to determine their reliability as learning tools.

Materials/Methods: The 2024 RAPHEX Therapy Examination, comprising 140 multiple-choice questions (MCQs) on radiation interactions, dosimetry, treatment planning, and quality assurance, was used with permission. Six LLMs answered all questions under a standardized prompt requiring precise, mathematically rigorous, evidence-based responses without prior exam exposure. Each model completed five trials, with answers graded as correct or incorrect using the RAPHEX key. Response consistency was measured by answer variability across trials. Statistical analysis used one-way ANOVA with Bonferroni correction to compare accuracy, explanation depth, and consistency. Explanation similarity was measured using cosine similarity on Sentence-BERT embeddings at LLM temperature of 0.2.

Results: LLMs showed varied performance on the Radiation Oncology Physics Board Exam. Claude led with 120/140 correct (86%), followed by GPT-4o (112, 80%), GPT-4-turbo (108, 77%), Gemini (104, 74%), GPT-3.5-turbo (66, 47%), and DeepSeek (66, 47%). A one-way ANOVA found significant accuracy differences among models (F = 20.28, p < 3.90e-19). Differences were strongest for non-math questions (F = 18.79, p < 1.54e-17) and radiation interactions (F = 15.03, p < 7.04e-14). Math-related questions showed weaker variation (F = 2.72, p = 0.022), while dosimetry (F = 4.83, p = 0.0004) also differed significantly. No significant variation was found for quality assurance (F = 0.67, p = 0.66) or treatment planning (F = 1.74, p = 0.13). Tukey HSD analysis showed Claude and GPT-4o performed best, with no significant difference between them (p = 0.12). GPT-3.5-turbo and DeepSeek had significantly lower accuracy (p < 1.53e-12). Performance differences were strongest for non-math and radiation interactions but weaker for math-heavy dosimetry.

Conclusion: Findings highlight LLMs’ strengths and limitations in radiation oncology physics, informing their potential for board exam preparation, resident training, and AI-assisted decision-making. As ASTRO 2025 explores AI’s role in radiation oncology, this study underscores LLMs’ promise while emphasizing the need for improved quantitative reasoning, reliability, and interpretability to enhance their educational and clinical utility.