Main Session

Sep 30

PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3758 - Microsoft Copilot's Performance in Extracting Variables about Radiation Oncology Breast Cancer Articles

04:00pm - 05:00pm PT

Hall F

Screen: 14

POSTER

Presenter(s)

Qian Zhang, MD, PhD, MS - McGaw Medical Center of Northwestern University, Chicago, IL

Q. S. Zhang¹, A. Ho¹, S. Pachigolla¹, and E. D. Donnelly²; ¹Department of Radiation Oncology, McGaw Medical Center of Northwestern University, Chicago, IL, ²Department of Radiation Oncology, Northwestern Memorial Hospital, Chicago, IL

Purpose/Objective(s): We hypothesized that Microsoft Copilot could extract values for variables from Radiation Oncology breast cancer articles with good correctness and completeness and that quotes and confidence levels could facilitate human review of Copilot’s responses.

Materials/Methods: We obtained pdfs (33 full-length and 1 abstract) for random articles from high-yield RadOncQuestions questions. From Jan 28, 2025 to Feb 1, 2025, we prompted Copilot to extract values from each pdf in JSON format for variables including: Population Studied, Intervention Group, Comparison Group, Outcome, Study Design, Year of Publication, Number of Participants, Methods, Results, and Interpretation. We prompted Copilot for the confidence (%) in each value being correct and quotes supporting the value. Copilot extracted values for 26 articles that were read into R 4.4.1 to create tables of variables’ values for each article. Three blinded Radiation Oncology physicians rated each value’s completeness (% of the article’s information about the value that was captured by the value) and correctness (% of the value that was correct), and quotes’ correctness (% of the quote that was truly in the article) on 5-point Likert scales (1:~0%, 1:~25%, 3:~50%, 4:~75%, 5:~100%). We tested the hypothesis of mean ratings being the same across variables using one-way ANOVA. We linearly regressed the value’s correctness on its supporting quote’s correctness with mean model E(Value’s correctness | Quote’s correctness) = ß₀ + ß₁*Quote’s correctness. We converted 110 “High” and “Low” confidence responses to 90% and 25% respectively, combined these numbers with 176 confidence percents, and regressed the value’s correctness on its confidence (%*100) with mean model E(Correctness | Confidence) = ß₀ + ß₁*Confidence.

Results: The mean value’s completeness was 4.23 (SD 1.04), mean value’s correctness was 4.75 (SD 0.72), and mean quote’s correctness was 4.83 (SD 0.55). Variables statistically significantly differed in mean completeness (3.62 for Population Studied - 5.00 for Year of Publication, p<0.001) and mean correctness (4.15 for Comparison or Control Group - 5.00 for Year of Publication, p<0.001), but not in mean quote’s correctness. The supporting quote’s correctness was statistically significantly associated with the value’s correctness (ß₁ estimate=0.23, p=0.003). All physicians reported that quotes facilitated their ratings. Copilot’s confidence was not associated with values’ correctness (ß₁ estimate=-0.01, p=0.48).

Conclusion: Copilot had overall good correctness and completeness (mean >= 4/5), which varied across variables, in extracting variables’ values from Radiation Oncology breast cancer articles. Prompting Copilot for supporting quotes facilitated human review of values, while prompting for the percent confidence in the value being correct did not. Copilot can assist Radiation Oncology education.