2773 - Deep Learning for Automatic T-Staging Prediction of Laryngeal Cancer Using CT Scan and Radiology Report

10:45am - 12:00pm PT

Hall F

Screen: 14

POSTER

Presenter(s)

Dakai Jin, PhD - Alibaba Group (US) Inc., New York, NY

X. Zhao¹, Y. Su², J. Wang³, Z. Ji⁴, Y. Wang⁴, S. Chen³, Y. Cheng³, L. Lu⁵, D. Jin⁵, and N. Shen⁶; ¹Xi'an Jiaotong University, Xi'an, Shaanxi, China, ²Alibaba DAMO Academy, Hangzhou, China, ³Zhongshan Hospital, Shanghai, Shanghai, China, ⁴Alibaba Group (US) Inc., Washington, DC, ⁵Alibaba Group (US) Inc., New York, NY, ⁶Zhongshan Hospital, Fudan University, Shanghai, China

Purpose/Objective(s):

Accurate T-staging of laryngeal cancer is crucial for treatment planning and prognostic analysis. Conventional T-staging relies on radiologists' subjective CT interpretation, which suffers from low accuracy and large inter-observer variability. While deep learning offers potential solutions, existing approaches require manual region of interest (ROI) delineation in CT scans and neglect valuable information in radiology reports. In this study, we develop and validate a multimodal T-staging prediction deep network by integrating CT visual features and textual information from report to improve diagnostic accuracy.

Materials/Methods:

We collected 392 pathologically confirmed laryngeal cancer patients (T1-T4 stages). Each patient included a contrast-enhanced CT scan and the corresponding radiology report. A pre-trained head and neck organ segmentation model was applied to automatically locate the ROI of laryngeal tumor. This automated process eliminated the need of time-consuming manual delineation, ensuring to focus on learning tumor characteristics. Subsequently, a dual-stream deep attention network was developed for multimodal T-staging prediction. This network integrated two key features: (1) global CT features, aligned with textual report semantics, to provide a macroscopic anatomical relationship analysis, and (2) localized CT ROI features extracted from the cropped tumor regions to examine the local tumor features. These complementary global and local representations were fused through an adaptive weighting mechanism to generate the final T-stage prediction. Performance was evaluated using a 20% independent test set (n=79), with patients categorized as early-stage (T1+T2) vs. advanced-stage (T3+T4). Quantitative metrics included F1-score and accuracy (ACC).

Results:

The performance of T-staging prediction is summarized in Table 1. Our proposed model demonstrates superior performance with an F1-score of 78.5% and an accuracy of 79.8%. This represents a substantial improvement of 22.7% in F1-score compared to the baseline deep learning model with only CT input (55.8% F1, 57.0% ACC). Compared to radiologist performance (66.8% F1, 67.1% ACC), our multimodal approach achieved a statistically significant increase of 11.7% in F1-score, underscoring the benefit of combining CT imaging features with textual semantics.

Conclusion:

We developed a multi-modal, automated T-staging prediction model of laryngeal cancer using both CT scan and radiology report. Our findings suggest that this model has the potential to facilitate the preoperative T-staging accuracy for laryngeal cancer and provide additional information for personalized treatment planning.

Abstract 2773 - Table 1: Quantitative evaluation result of T-staging classification

Methods	F1	Acc
CT Input	0.5580	0.5700
Multimodal model	0.7850	0.7950
Radiologist	0.6680	0.6710