2483 - Enhancing Adverse Event Monitoring in Prostate Cancer Care with Large Language Models
Presenter(s)
M. O. Shotande1, J. Delikat1, R. M. Fernando1, G. Rasool1, T. Welniak1, J. Andreozzi2, P. A. S. Johnstone3, E. Katsoulakis4, A. P. Dicker5, H. Jim1, and I. El Naqa6; 1Moffitt Cancer Center, Tampa, FL, 2H. Lee Moffitt Cancer Center and Research Institute, Department of Radiation Oncology, Tampa, FL, 3Department of Radiation Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 4Veterans Affairs, Tampa, FL, 5Department of Radiation Oncology, Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA, 6Machine Learning & Radiation Oncology, Moffitt Cancer Center, Tampa, FL
Purpose/Objective(s):
The grading of radiation-induced toxicities to evaluate the severity of side effects is typically done retrospectively and manually through clinical standards such as the Common Terminology Criteria for Adverse Events (CTCAE). It is a laborious process that consumes valuable time while error-prone and susceptible to under-reporting. This can lead to uninformative outcome models and subsequent poor decision-making. Therefore, an efficient and accurate processing of clinical notes to grade such toxicities is crucial. Natural Language Processing provides powerful tools, such as large language models (LLMs) that automate complex text analysis. Applying LLMs to clinical notes for toxicity evaluation can save clinicians time, improve consistency in toxicity grading, and ultimately improve patient outcomes. This was evaluated in the context of prostate cancer.Materials/Methods:
60 unique clinic notes were retrospectively collected from 47 prostate patients seen in a GU cancer clinic. Patients are men with locally advanced or metastatic prostate cancer, who have been treated with radiation, and have completed at least 1 patient-reported outcome assessment. The patients' ages range from 50 to 89 years old. The LLM LLaMA 3.1 (with 8 billion weights), prompt engineering and Retrieval Augmented Generation (RAG) were deployed to create a robust, automated, and updatable CTCAE grading pipeline. Specifically, CTCAE v4.03 was used as the grading standard, and the NIH Cancer Therapy Evaluation Program webpage spreadsheet was given to the RAG system to obtain relevant context for the LLM prompts. For each clinic note, the ground-truth CTCAE grades from a clinician were removed and used to evaluate the LLM extracted grades.Results:
Overall performance is high. The accuracy and weighted sensitivity, precision, and F1 scores are greater than 71% (Table 1). However, CTCAE grade 0 is overrepresented. Unweighted average scores are at least 40%. The model generally over-predicts grades 0 and 1, under-predicts grades above 1, and most errors are off by 1 grade. Improved performance was often observed with RAG to dynamically provide relevant context for each patient’s clinical note, versus without RAG. Unweighted sensitivity was higher without RAG and weighted precision was about the same. The Cohens Kappa score, comparing the clinician-provided grades to the LLM-provided ones, increased from 37% to 50% when using RAG.Conclusion:
We effectively automated the evaluation of toxicity grades from clinical notes using a publicly available free LLM. Using RAG, we created a pipeline that can be updated as medical criteria and standards evolve. Based on the higher Cohens Kappa score when using RAG, greater alignment can be achieved with how clinicians determine toxicity grades through LLM/RAG-based systems. Table 1NO RAG | RAG | |
Accuracy | .72 | .83 |
Sensitivity* | .51 | .49 |
Sensitivity** | .72 | .83 |
Precision* | .40 | .42 |
Precision** | .86 | .86 |
F1* | .42 | .45 |
F1** | .76 | .84 |
Cohens Kappa | .37 | .50 |
* Unweighted average ** Weighted average |