3726 - When Big Data Plateaus: Investigating the Limits of Data and Model Scaling for Cervical Cancer Segmentation
Presenter(s)
H. Sun, W. Huang, H. Xiao, X. Deng, A. Qu, J. Wang, and P. Jiang; Department of Radiation Oncology, Peking University Third Hospital, Beijing, China
Purpose/Objective(s): Precise segmentation of cervical cancer remains challenging due to inherent image variability and often subtle tumor features. The prevailing “big data + big model” hypothesis suggests that continuously increasing training dataset size and model complexity leads to steady improvements. This study assesses whether scaling up data and model size under the "big data + big model" hypothesis leads to meaningful performance improvements, or if alternative approaches are needed.
Materials/Methods: We tested this hypothesis using a dataset of 500 cervical cancer patients (50/20 training/test split, expert annotations with TiGRT MC TPS v2.0). Model scaling was assessed by evaluating 5 deep learning models, defined as "big" or "small" based on model parameters & pre-training data scale: (BM1) SAM (largest, general pre-training), (BM2) Swin-Unet (50K medical pre-training), (BM3) Swin-Unet (5K medical pre-training), (SM1) 3D Unet (1K medical pre-training), (SM2) 3D Unet (smallest, no pre-training). Dataset scaling was further evaluated for the top models (BM2 and BM3), by training them with increasing dataset sizes (50 to 500 patients). Segmentation performance was assessed using 3D Dice scores.
Results: At the largest dataset size (500 patients), Dice scores were: BM1 0.60 ± 0.05, BM2 0.75 ± 0.03, BM3 0.80 ± 0.02, SM1 0.65 ± 0.04, SM2 0.70 ± 0.04. Regarding model scaling, surprisingly, within big models, BM3, the least complex model, achieved the highest Dice, outperforming BM2 and BM1. Similarly, among smaller models, SM2, the less complex model, outperformed SM1, the more complex model. For cross-group comparison, while the general trend of big models outperforming smaller models holds true in some cases (e.g., BM2 > SM1), we observed exceptions: SM2, a smaller model, unexpectedly outperformed BM1, a big model. Regarding dataset scaling, BM3 consistently yielded higher Dice than BM2 across dataset sizes. However, for BM3, peak performance (0.81 ± 0.02) was at 200 patients, with plateauing and slight decrease to 0.80 ± 0.02 at 500 patients. Similarly, BM2 showed plateauing around 0.75 ± 0.03 beyond 300 patients. Segmentation performance for different models summarized in table 1.
Conclusion: This study challenges the simplistic "big data + big model" hypothesis for cervical cancer segmentation. While big models can offer advantages, especially with targeted pre-training, general outperformance of big models was not observed. Instead, optimal performance was achieved by a moderately sized, medically pre-trained model, and simply scaling dataset size provided limited benefit. Alternative approaches, such as integrating multi-modal or clinical data and reducing data heterogeneity, warranting further investigation.
Abstract 3726 - Table 1Model | Description | Dice |
BM1 | Largest, general pre-training | 0.60 ± 0.05 |
BM2 | Big, 50K medical pre-training | 0.75 ± 0.03 |
BM3 | Big, 5K medical pre-training | 0.80 ± 0.02 |
SM1 | Small, 1K medical pre-training | 0.65 ± 0.04 |
SM2 | Smallest, no pre-training | 0.70 ± 0.04 |