3664 - Multimodal Large Model for Automated Delineation of Primary Gross Tumor Volume in Nasopharyngeal Carcinoma Radiotherapy: A Pilot Study
Presenter(s)
L. Lin1, Y. Liu2, L. Jia2, X. Zeng3, H. Li2, Y. Li3, and Y. Sun1; 1State Key Laboratory of Oncology in South China, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangdong Provincial Clinical Research Center for Cancer, Department of Radiation Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong, China, 2Shenzhen United Imaging Research Institute of Innovative Medical Equipment, Shenzhen, China, 3State Key Laboratory of Oncology in South China, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou, China
Purpose/Objective(s):
This study aimed to establish the feasibility and preliminary validity of a multimodal deep learning framework integrating clinical text and imaging data for automated delineation of gross tumor volume (GTVp) in nasopharyngeal carcinoma (NPC) radiotherapy. The primary objective was to explore whether structured clinical parameters could enhance segmentation accuracy and context-aware adaptability, thereby improve the efficiency and interpretability of segmentation.Materials/Methods: A retrospective cohort of 773 NPC patients was included in this study. Four magnetic resonance (MR) imaging sequences (T1, T1c, T1FSC and T2), manually-contoured GTVp, and structured clinical text (tumor stage, chemotherapy status) were acquired. Data were partitioned into training (618, 79.9%), validation (77, 10%), and testing (78, 10.1%) sets. MR images underwent intensity normalization and resampling to 1×1×3 mm³. A dual-branch framework was designed as: (a) A 3D U-Net processed MR volumes and (b) Llama-7B extracted embeddings from text inputs. A cross-attention fusion module aligned text and image features at multiple encoder layers. Segmentation accuracy was quantified via Dice similarity coefficient (DSC). Additionally, text-guided adaptability was assessed by varying input text while holding imaging constant.
Results:
The multimodal model achieved a mean DSC of 0.82 ± 0.03 on the testing set, outperforming the image-only baseline model (mean DSC of 0.80). Regarding text-guided adaptability, inputting tumor stages of T2–T4, the predicted tumor volumes were increased by 2.00%, 4.14%, and 9.49% respectively when compared with inputting of T1. These increases align with patterns observed in advanced disease stages. Volumes generated with “chemo” text input were 23.59% larger than “non-chemo” inputs (p<0.05), reflecting expected post-induction chemotherapy shrinkage in clinical contours. Among the 29 test cases, the highest DSC for predicted segmentations was achieved when both correct T stage and chemotherapy information were provided to the model. This suggests that augmenting the input with comprehensive clinical context may improve model performance, particularly in capturing subtle anatomical variations. Lower DSC values (0.70–0.75) occurred in cases with small tumors (T1), suggesting challenges in text-image feature alignment.Conclusion:
This pilot study demonstrates that integrating structured clinical text with multimodal imaging significantly improves NPC segmentation accuracy and enables dynamic, context-aware target delineation. The model's ability to adapt outputs based on tumor stage and treatment history highlights its potential for personalized radiotherapy planning. Future work will focus on optimizing feature fusion for complex cases and validating the framework in prospective multicenter trials.