3728 - Cross Attention Swin Transformer for Multi-Modal Head and Neck Cancer Segmentation in Data Limited Regimes
Presenter(s)

S. Tan1, J. Jiang2, S. Elguindi2, N. Y. Lee3, and H. Veeraraghavan2; 1Weill Cornell Graduate School of Medical Sciences, New York, NY, 2Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, 3Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY
Purpose/Objective(s): Inclusion of diagnostic 18F-FDG-PET with computed tomography (CT) scans acquired for radiation treatment planning (RTP) can increase delineation accuracy of primary gross tumors (GTVp) and lymph nodes (GTVn) for head and neck (HN) cancers. Spatial misalignments of diagnostic PET with simulation CT scans can reduce accuracy of automated methods. Hence, this study employed a multimodal deep learning model, XLinker, to combine 18F-FDG-PET with CTs through a cross attention module.
Materials/Methods: XLinker extracts features from 18F-FDG-PET and CT scans using identical Swin transformer encoders. Features from both modalities are combined using cross attention implemented in stage 3 of the transformer. A UNet convolutional decoder is used to generate segmentations. Data efficient training is performed by using pretrained Swin encoders created through self-supervised learning with 10,412 unlabeled CT datasets from various disease sites sourced from the Cancer Imaging Archive to extract anatomically useful features. Data-efficient fine-tuning was performed using 10%, 20%, 50% of the training data (419 PET and CT scans) as well as full-shot training using all 419 scans within 5-fold cross validation from the public HECKTOR 2022 challenge dataset. Testing was performed on held-out 105 cases from HECKTOR, as well as 29 institutional scans with varying levels of misalignments between the PET and CT scans.
Results: XLinker generated similar accuracies as full-shot training for both GTVp and GTVn in data-limited training regimes for the public dataset (Table 1). Significant differences were observed in the institutional dataset for GTVn but not GTVp. XLinker segmented GTVp with higher accuracy than GTVn in the public but opposite trend was observed in the institutional dataset. XLinker was somewhat resistant to spatial misalignments in the institutional dataset especially for GTVn compared to GTVp, where the former had larger volumes compared to latter.
Conclusion: XLinker showed capability to generate comparably accurate segmentations even in data-limited training settings, indicating capability to be trained in data-efficient modes when using pretrained networks to combine multimodality images. Efficient use of labeled data to achieve reasonably accurate segmentations is an important feature for extending use of deep learning beyond organs to tumors.
Abstract 3728 - Table 1: XLinker accuracies in data-limited and full shot training depicted using mean ± standard deviation of DSC metric. Performance differences between full and data-limited models was measured using paired, two-sided Wilcoxon signed rank tests at 95% confidence levels *(p < 0.05), ** (p < 0.01), *** (p < 0.001).Training data (%) | Public testing (N=105) | Institutional testing (N = 29) | ||
GTVp | GTVn | GTVp | GTVn | |
10 | 0.76±0.14*** | 0.66±0.175** | 0.59±0.15 | 0.69±0.09*** |
20 | 0.78±0.13 | 0.69±0.16 | 0.58±0.16 | 0.71±0.12* |
50 | 0.78±0.12 | 0.71±0.15 | 0.595±0.14 | 0.72±0.12** |
100 (full) | 0.79±0.12 | 0.70±0.15 | 0.62±0.12 | 0.76±0.10 |