3645 - Efficient CTCAE Grading for Post-Radiotherapy Toxicities Using Large Language Models: A Privacy-Preserving Approach Using Instruction Fine-Tuning
Presenter(s)
R. Khanmohammadi1, A. I. Ghanem2,3, A. R. Bhatnagar4, J. Turfa5, S. Siddiqui6, M. A. Elshaikh5, H. Bagher-Ebadian5, B. Movsas5, I. J. Chetty7, M. M. Ghassemi1, and K. Thind5; 1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 2Department of Radiation Oncology, Henry Ford Hospital, Detroit, MI, 3Clinical Oncology Department, Faculty of Medicine, Alexandria University, Alexandria, Egypt, 4Department of Radiation Oncology, Henry Ford Cancer Institute, Detroit, MI, 5Department of Radiation Oncology, Henry Ford Health, Detroit, MI, 6Radiation Oncology, Henry Ford Health, Detroit, MI, 7Department of Radiation Oncology, Cedars-Sinai Medical Center, Los Angeles, CA
Purpose/Objective(s): Accurate Common Terminology Criteria for Adverse Events (CTCAE) grading is vital for patient care and clinical decision modeling toward the goal of precision medicine. This study introduces a novel, parameter-efficient, and privacy-preserving method for automated CTCAE grading by leveraging instruction fine-tuning (IFT) of compact language models, aiming to improve grading accuracy while minimizing computational demands.
Materials/Methods: We fine-tuned two language models, Llama-3.1-8B (Llama) and Qwen2.5-7B (Qwen), using explicit CTCAE grading guidelines. Low-Rank Adaptation (LoRA, rank 128, a = 32) was applied to the attention, feed-forward, and embedding layers, improving the models’ understanding of clinical terminologies and refining their focus on relevant contexts. Chain-of-thought (CoT) prompting further enhanced reasoning during grading. Our models were trained on 333 expert-labeled clinical notes from 45 prostate cancer patients treated with 78 Gy radiation (2017–2021), covering 12 toxicity symptoms: cystitis, dysuria, erectile dysfunction, hematuria, incontinence, nocturia, proctitis, rectal bleeding, stricture, urgency, urinary frequency, and urinary retention. Two expert clinicians graded notes into Grade (G) 1–3 (Cohen’s ? = 0.88; 92% agreement). A stratified five-fold cross-validation was performed with a 50-10-40 train-validation-test split—yielding approximately 166, 33, and 134 notes per fold—while preserving toxicity severity distribution. Metrics included class-specific F1, macro-averaged precision, recall, area under the receiver operating characteristic curve (AUCROC), and area under the precision-recall curve (AUCPR).
Results: Both models improved post-IFT across metrics (Table 1). Llama-3.1-8B’s median F1 scores rose from 48% to 53% (Grade 1), 68% to 71% (Grade 2), and 56% to 71% (Grade 3); precision increased from 49% to 66%, recall from 43% to 72%. Qwen2.5-7B’s median F1 scores improved from 47% to 52% (Grade 1), 53% to 69% (Grade 2), and 56% to 66% (Grade 3); precision rose from 45% to 62%, recall from 42% to 67%.
Conclusion: This framework, using IFT, LoRA, and CoT, improves toxicity grading accuracy and consistency. It offers a privacy-preserving, scalable solution for better clinical decisions and patient care in radiation oncology.
Abstract 3645 - Table 1: Performance metrics showing improved CTCAE grading after instruction fine-tuning (IFT)Model | PC Stats | G1 F1 | G2 F1 | G3 F1 | Precision | Recall | AUCROC | AUCPR | |||||||
Initial | IFT | Initial | IFT | Initial | IFT | Initial | IFT | Initial | IFT | Initial | IFT | Initial | IFT | ||
Llama | Median IQR | 48 48-53 | 53 48-57 | 68 63-70 | 71 70-71 | 56 45-57 | 71 68-75 | 49 44-51 | 66 61-67 | 43 39-44 | 72 63-73 | 70 65-70 | 78 70-79 | 51 44-52 | 57 51-59 |
Qwen | Median IQR | 47 43-48 | 52 51-54 | 53 51-53 | 69 68-73 | 56 54-56 | 66 66-73 | 45 41-47 | 62 48-67 | 42 34-43 | 67 52-74 | 67 66-68 | 77 74-78 | 47 47-50 | 55 53-59 |