Main Session
Sep 30
PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3679 - Automated Toxicity Extraction from Radiation Therapy Notes Using Large Language Models

04:00pm - 05:00pm PT
Hall F
Screen: 9
POSTER

Presenter(s)

Jordan McDonald, MD - MD Anderson Cancer Center, Houston, TX

R. Kouzy1, J. McDonald1, W. Floyd1, O. Mohamad1, L. Colbert2, A. H. Klopp2, and D. S. Bitterman3; 1Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, 2Division of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, 3Department of Radiation Oncology, Dana-Farber Cancer Institute/Brigham and Women’s Hospital, Harvard Medical School, Boston, MA

Purpose/Objective(s): Extracting toxicity information from clinical notes is crucial for patient care, yet the process remains labor-intensive and subject to variability across clinicians and institutions. In this study, we evaluated multiple large language model (LLM) workflows to extract Common Terminology Criteria for Adverse Events (CTCAE) —specifically radiation dermatitis, nausea, and fatigue— from clinical notes of patients undergoing radiation therapy.

Materials/Methods: We assembled a synthetic dataset of 100 on treatment visit notes from radiation therapy patients: 15 manually authored notes by a resident physician, and 85 notes generated using Claude 3.5 Sonnet (version February 2025). The resulting corpus captured varying severity levels of fatigue (grades 0–3), nausea (grades 0–3), and dermatitis (grades 0–4). We evaluated three CTCAE extraction workflows: a retrieval-augmented generation (RAG) system that incorporated CTCAE v5.0 definitions; a prompt-based approach that embedded CTCAE grades alongside representative few-shot examples; and an agentic pipeline where an initial agent classified toxicity grades, followed by a second LLM resolving ambiguous cases. All evaluations were conducted using the OpenAI GPT-4o model (version February 2025) via API calls, with performance measured using macro-averaged F1 score, precision, and recall across three tasks: task 1 distinguishing between no toxicity and any toxicity; task 2 differentiating between mild toxicity (grade 1) and significant toxicity (grades 2–3 for fatigue and nausea, grades 2–4 for dermatitis); and task 3 categorizing cases as no toxicity versus mild toxicity versus significant toxicity.

Results: The RAG system demonstrated the best overall performance, achieving macro-F1 scores of 0.82, 0.89, and 0.80 for the three tasks, respectively. For individual toxicities, the RAG system reached macro-F1 scores of 0.86 for fatigue, 0.85 for nausea, and 0.81 for dermatitis averaged across all tasks. Although both the agentic and prompt-based few-shot approaches performed comparably, neither surpassed the RAG system overall. Moreover, performance averaged across tasks revealed precision and recall values of approximately 0.84 and 0.91 for fatigue, 0.88 and 0.85 for nausea, and 0.79 and 0.83 for dermatitis, respectively. Although both the agentic and prompt-based few-shot approaches performed comparably for nausea extraction, neither surpassed the RAG system overall.

Conclusion: Our findings indicate that out-of-the-box LLMs hold promise in accurate identification of CTCAEs in a radiation therapy setting. These results underscore the potential of LLM-based workflows to streamline and improve the process of acute toxicity abstraction in clinical practice.