3609 - Comparative Performance of AI Models in Determining Radiation Therapy Plan for Breast Cancer Patients after Surgery
Presenter(s)
C. Dvorak, P. Kelly, and T. Dvorak; Department of Radiation Oncology, Orlando Health Cancer Institute, Orlando, FL
Purpose/Objective(s): Patients increasingly use artificial intelligence (AI) models to interpret their cancer information and seek treatment advice. This study evaluates the accuracy of AI-generated radiation therapy (RT) recommendations against National Comprehensive Cancer Network (NCCN) guideline-based clinical assessments, assessing their accuracy.
Materials/Methods: In this IRB-approved study, 100 consecutive patients diagnosed with curative-intent breast cancer, seen in a radiation oncology clinic post-surgery in 2023, were analyzed. De-identified Oncologic Histories were input into OpenAI GPT-o1, Google Gemini 2.0 Flash, Claude 3.5 Sonnet, and xAI Grok3. NCCN guidelines (v1.2025) were used to determine guidelines concordant care as Radiation Required, Radiation Optional, or Radiation Not Recommended. No assumptions were made about endocrine therapy, unless patients already started it, which resulted in patients 70+ with pT1N0 tumors being classified as Radiation Yes rather than Optional. Each model generated an RT plan: Radiation Required = 1, Optional = 0, Not Recommended = -1. Outputs were compared to our NCCN-based assessment (Required = 1, Optional = 0, Not Recommended = -1). Concordance was defined as exact radiation treatment recommendations (1 to 1, 0 to 0, -1 to -1).
Results: Per clinical NCCN assessment, RT was required in 59 patients (59%), optional in 33 (33%), and not recommended in 8 (8%). AI model recommendations varied: Gemini showed 62% concordance (100% sensitivity, 38% specificity); GPT 63% concordance (92% sensitivity, 63% specificity), Claude 67% (98% sensitivity, 63% specificity), and Grok 72% (97% sensitivity, 75% specificity). Optional cases were recommended RT in 100% of Gemini, 82% of GPT, 82% of Claude, and 58% of Grok cases. Most ‘straightforward’ cases were classified correctly, but the models sometimes struggled with nuanced scenarios such as cN+ to ypN0 cases. On individual record review, the models were able to pick up and consider impact of prior radiation history, internal mammary involvement, change in receptor status from original biopsy to final pathology, BRCA mutations, or pacemaker presence in their recommendations.
Conclusion: Pattern of recommendations of the AI models varied. Grok3 best aligned with NCCN-based guidelines, while Gemini prioritized treating with RT over not treating. The models recommended radiation in 92-100% of patients who needed it. Optional cases were frequently misclassified as RT Required, suggesting bias to active treatment, possibly due to sensitivity to underlying risk factors in the Oncologic History. As patients begin to use AI models more frequently for a second opinion or for personal decision-making, physicians may need to be aware of the model biases to guide shared decision-making more effectively.