3735 - AI-Assisted Contour Quality Assurance for HN005 Clinical Trial
Presenter(s)
D. Wang1, S. S. Yom2, H. Geng1, S. H. Lee3, C. Henson4, J. A. Dorth5, J. W. Chan6, C. E. Lominska7, J. L. Harper8, M. F. Gensheimer9, W. A. Stokes10, J. R. Robbins11, S. H. Mashru12, A. Raben13, R. J. Kimple14, P. Bhateja15, Q. T. Le16, and Y. Xiao3; 1Department of Radiation Oncology, University of Pennsylvania, Philadelphia, PA, 2University of California San Francisco, San Francisco, CA, 3Department of Radiation Oncology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 4University of Oklahoma Health Sciences Center, OKLAHOMA CITY, OK, 5Department of Radiation Oncology, University Hospitals Seidman Cancer Center, Case Western Reserve University, Cleveland, OH, 6Department of Radiation Oncology, University of California San Francisco, San Francisco, CA, 7Department of Radiation Oncology, The University of Kansas Medical Center, Kansas City, KS, 8Department of Radiation Oncology, Medical University of South Carolina, Charleston, SC, 9Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, 10Winship Cancer Institute, Emory University School of Medicine, Atlanta, GA, 11University of Arizona, College of Medicine-Tucson, Department of Radiation Oncology, Tucson, AZ, 12Accrual-Kaiser Permanente NCI Community Oncology Research Program, Portland, OR, 13Radiation Oncologists, PA; Christiana Care, Newark, DE, 14University of Wisconsin Carbone Cancer Center, Madison, WI, 15Department of Medical Oncology, The Ohio State University Wexner Medical Center, Columbus, OH, 16Stanford University, Stanford, CA
Purpose/Objective(s): Quality assurance (QA) of patient contours is critical for clinical trial success. Manual review of head-and-neck (HN) cancer contours is labor-intensive and subject to interobserver variability. This study employs AI-based auto-segmentation tools to improve efficiency and consistency in contour QA for clinical trial HN005.
Materials/Methods: Radiotherapy data of 187 patients submitted to HN005 were analyzed. All submitted contours including organs-at-risk (OARs) and targets, were initially reviewed and scored by physicians according to protocol guidelines: score 1 (per protocol), score 2 (variation acceptable), and score 3 (deviation unacceptable). Two commercial auto-segmentation platforms, Carina INTContour and Therapanacea, were implemented to generate critical OARs, including brainstem, cochlea, bone mandible, pharynx, spinal cord, submandibular glands, larynx, thyroid, esophagus_s (upper cervical esophagus), and brachial plexus. A custom gross tumor volume (GTV) model for primary and nodal tumors was trained using Carina on 120 score 1 & 2 cases and tested on 51 cases. The AI-generated contours were compared with the submitted contours for all 187 patients using dice similarity coefficient (DSC), mean surface distance (MSD), and 95% Hausdorff distance (HD95). Thresholds were established by averaging the matrix for all score 1 contours with a 95% confidence interval for statistical reliability. The thresholds were then utilized to review the submitted contours and generate AI-based QA scores. For each AI tool, sensitivity in identifying unacceptable cases and the false positive rates (FPR) were reported using the physician review score as ground truth.
Results: The proportion of cases with unacceptable contours ranged from 2/78 (BrachialPlexus) to 16/70 (Larynx_SG). Both AI tools demonstrated high sensitivity (>80%) in identifying unacceptable contours for the spinal cord, brainstem, cochlea_L/R, bone mandible, esophagus_s, submandibular glands_L/R, thyroid, larynx_SG, and brachial plexus_L/R. The FPRs for acceptable contours ranged from 5.5% to 49.1% across both tools. However, performance was notably lower for the larynx_SG, with FPR exceeding 72.2% for both tools. For each tool, the pharynx showed lower sensitivity at 53.8% and 61.5% respectively, with a consistent FPR of 17.5%. For GTV, AI successfully identified 76% of score 3 cases, accompanied by an FPR of 46%.
Conclusion: AI-based auto-segmentation tools demonstrated potential in reducing manual QA burdens and enhancing consistency in clinical trials for HN cancer. While these tools effectively detect unacceptable contours, their QA performance is limited by the quality of the submitted contours and the suboptimal GTV delineation based solely on CT imaging. Addressing these limitations requires further efforts to acquire high-quality data to establish a robust gold standard and to develop customized workflows that more precisely identify specific errors in unacceptable contours.