Main Session

Sep 30

PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3645 - Efficient CTCAE Grading for Post-Radiotherapy Toxicities Using Large Language Models: A Privacy-Preserving Approach Using Instruction Fine-Tuning

04:00pm - 05:00pm PT

Hall F

Screen: 6

POSTER

Presenter(s)

Reza Khanmohammadi, MS - Michigan State University, East Lansing, MI

R. Khanmohammadi¹, A. I. Ghanem^2,3, A. R. Bhatnagar⁴, J. Turfa⁵, S. Siddiqui⁶, M. A. Elshaikh⁵, H. Bagher-Ebadian⁵, B. Movsas⁵, I. J. Chetty⁷, M. M. Ghassemi¹, and K. Thind⁵; ¹Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, ²Department of Radiation Oncology, Henry Ford Hospital, Detroit, MI, ³Clinical Oncology Department, Faculty of Medicine, Alexandria University, Alexandria, Egypt, ⁴Department of Radiation Oncology, Henry Ford Cancer Institute, Detroit, MI, ⁵Department of Radiation Oncology, Henry Ford Health, Detroit, MI, ⁶Radiation Oncology, Henry Ford Health, Detroit, MI, ⁷Department of Radiation Oncology, Cedars-Sinai Medical Center, Los Angeles, CA

Purpose/Objective(s): Accurate Common Terminology Criteria for Adverse Events (CTCAE) grading is vital for patient care and clinical decision modeling toward the goal of precision medicine. This study introduces a novel, parameter-efficient, and privacy-preserving method for automated CTCAE grading by leveraging instruction fine-tuning (IFT) of compact language models, aiming to improve grading accuracy while minimizing computational demands.

Materials/Methods: We fine-tuned two language models, Llama-3.1-8B (Llama) and Qwen2.5-7B (Qwen), using explicit CTCAE grading guidelines. Low-Rank Adaptation (LoRA, rank 128, a = 32) was applied to the attention, feed-forward, and embedding layers, improving the models’ understanding of clinical terminologies and refining their focus on relevant contexts. Chain-of-thought (CoT) prompting further enhanced reasoning during grading. Our models were trained on 333 expert-labeled clinical notes from 45 prostate cancer patients treated with 78 Gy radiation (2017–2021), covering 12 toxicity symptoms: cystitis, dysuria, erectile dysfunction, hematuria, incontinence, nocturia, proctitis, rectal bleeding, stricture, urgency, urinary frequency, and urinary retention. Two expert clinicians graded notes into Grade (G) 1–3 (Cohen’s ? = 0.88; 92% agreement). A stratified five-fold cross-validation was performed with a 50-10-40 train-validation-test split—yielding approximately 166, 33, and 134 notes per fold—while preserving toxicity severity distribution. Metrics included class-specific F1, macro-averaged precision, recall, area under the receiver operating characteristic curve (AUCROC), and area under the precision-recall curve (AUCPR).

Results: Both models improved post-IFT across metrics (Table 1). Llama-3.1-8B’s median F1 scores rose from 48% to 53% (Grade 1), 68% to 71% (Grade 2), and 56% to 71% (Grade 3); precision increased from 49% to 66%, recall from 43% to 72%. Qwen2.5-7B’s median F1 scores improved from 47% to 52% (Grade 1), 53% to 69% (Grade 2), and 56% to 66% (Grade 3); precision rose from 45% to 62%, recall from 42% to 67%.

Conclusion: This framework, using IFT, LoRA, and CoT, improves toxicity grading accuracy and consistency. It offers a privacy-preserving, scalable solution for better clinical decisions and patient care in radiation oncology.

Abstract 3645 - Table 1: Performance metrics showing improved CTCAE grading after instruction fine-tuning (IFT)

Model

PC Stats

G1 F1

G2 F1

G3 F1

Precision

Recall

AUCROC

AUCPR

Initial

IFT

Initial

IFT

Initial

IFT

Initial

IFT

Initial

IFT

Initial

IFT

Initial

IFT

Llama

Median

IQR

48-53

48-57

63-70

70-71

45-57

68-75

44-51

61-67

39-44

63-73

65-70

70-79

44-52

51-59

Qwen

Median

IQR

43-48

51-54

51-53

68-73

54-56

66-73

41-47

48-67

34-43

52-74

66-68

74-78

47-50

53-59