3617 - Large Language Model Extraction of Radiation Treatment Variables from Clinical Documentation
Presenter(s)
M. Fis Loperena1, R. Kouzy2, M. B. El Alam2, E. Cha2, A. Grippin2, Z. El Kouzi3, W. A. Woodward4, Q. N. Nguyen2, K. E. Hoffman2, S. J. Shah3, L. Colbert3, A. H. Klopp3, and O. Mohamad2; 1The George Washington University School of Medicine and Health Sciences, Washington, DC, 2Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, 3Division of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, 4Department of Breast Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX
Purpose/Objective(s): Ensuring comprehensive documentation of radiation therapy (RT) details has become increasingly critical, particularly with the rising complexity of re-irradiation (re-RT) cases. Our previous work revealed significant gaps in end-of-treatment summaries and survivorship care plans (SCPs), highlighting the need for more reliable documentation systems. In this study, we developed a Large Language Model (LLM) workflow to extract RT related information from clinical notes with minimal to no overhead.
Materials/Methods: We selected 10 representative cases of patients receiving RT—5 breast and 5 genitourinary (GU)—from publicly available Association of Residents in Radiation Oncology (ARRO) teaching cases and generated 45 synthetic weekly on-treatment notes (OTNs) from these cases. We then evaluated each OTN for 15 key variables relevant to radiation therapy documentation standards. We used Google’s Gemini-2.0-flash-lite-preview-02-05, a LLM, for automated data extraction. We applied prompt engineering with a data-interchange format schema specifications to standardize variable capture. The LLM processed each OTN in chronological order and cross-referenced temporally related documents to aggregate toxicity assessments throughout the treatment course. A radiation oncology resident verified the accuracy of the extracted data using binary classification (correct vs. incorrect). We calculated precision, recall, and F1 scores for each variable and then macro-averaged these metrics across all 15 variables. We performed all analyses in Python within a Google Colab environment.
Results: The LLM-based approach demonstrated strong overall performance across the 15 key variables analyzed, achieving a macro-averaged F1 score of 0.96. For treatment planning documentation, the model showed perfect F1 scores for extracting RT intent, dose parameters, fractionation schedules, treatment modality specifications, re-RT, treatment start dates, and any course interruptions. For variables related to treatment response and additional clinical factors, including toxicity assessments and concurrent therapy details, the system achieved perfect extraction (F1=1.00). The model maintained excellent performance in identifying anatomic locations and treatment end dates (F1=0.90), though extraction of special RT techniques proved more challenging (F1=0.60). Overall, the model delivered consistently high performance with precision, recall, and F1 scores all at 0.96.
Conclusion: This proof-of-concept study demonstrates the feasibility of an LLM-based workflow for extracting critical RT variables from standard OTNs without additional clinical effort. We plan on exploring other techniques for improving performance on challenging variables and testing it in real-world clinical validation studies.