Main Session

Sep 30

PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3581 - Distance Metrics Predict Out-of-Distribution Performance of AI Oncology Models for Clinical Deployment

04:00pm - 05:00pm PT

Hall F

Screen: 2

POSTER

Presenter(s)

Syed Rakin Ahmed, MD, PhD - Harvard University, Massachusetts Institute of Technology and Dartmouth College, Hanover, NH

S. R. Ahmed^1,2, C. Lu², and J. Kalpathy-Cramer³; ¹Harvard University, Cambridge, MA, ²Massachusetts Institute of Technology, Cambridge, MA, ³Massachusetts General Hospital, Boston, MA, United States

Purpose/Objective(s):

While AI and deep learning pipelines for cancer diagnosis have surged, clinical translation remains sparse. A key concern is poor model performance upon deployment due to differences in real-world clinical data from training data.

Materials/Methods:

To address this concern, we designed and validated novel distance metrics and correlated them to retraining performance under dataset distribution shifts. We used the DMIST dataset (108,230 mammograms, 21,729 patients), a multi-device dataset for multi-class breast density classification to facilitate breast cancer detection. We evaluated classification metrics (average AUROC, accuracy and linear kappa) on the test set (65% train:10% validation:25% test) for deep neural network (DNN) training runs encompassing distribution shifts across four mammography scanners.

We identified two axes of distance metric computation for a classification model: 1. the n-dimensional feature (embeddings) vector space, F?R; and 2. the softmax prediction vector space, ?^|y|, where |y| is the number of classes and ? is the probability simplex. For each, we calculated three distance metrics: feature vector space – cosine, transport, KL divergence; prediction vector space – average thresholded confidence with maximum confidence as the score function (ATC-MC), average confidence, difference of confidences. Crucially, our approach can be applied to any underlying model architecture, since it relies solely on image-level model embeddings and prediction vectors for metric computation.

Results:

From the feature vector space, transport distance correlated best with classification performance for training runs on each source device (r'=-0.977, r_med=-0.975), where r' is the Fisher’s z-transformed mean correlation coefficient and r_med is the median correlation coefficient. Intuitively, higher transport distance signifies a distribution farther from the source, and results in poorer classification. From the prediction vector space, ATC-MC correlated best with classification performance (r'=0.962,r_med=0.900), suggesting good correlation between predicted and true classification performance. These correlations remained consistent across different model architectures and on repeat runs.

Conclusion:

Prior to clinical deployment, our approach can be used to generate calibration plots linking distance metrics with oncology model performance metrics across the different dataset distributions in a well-characterized dataset. Subsequently, these calibration plots can be used to make predictions on the expected classification performance on newly acquired datasets of unknown distributions, aligning with the FDA’s December 2024 guidance on AI deployment.