Main Session
Sep
30
PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics
3591 - AI Language Model Performance in Retrieving Phase III Radiotherapy Trials across Multiple Cancers
Presenter(s)
Warren Floyd, MD, PhD, BS - MD Anderson Cancer Center, Houston, TX
G. T. Carevic1,2, W. Floyd2, T. Kleber3, Z. Belal4, and C. D. Fuller2; 1University of Texas- Houston School of Medicine, Houston, TX, 2Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, 3Emory University, Atlanta, GA, 4Department of Radiation Oncology, University of Pennsylvania, Philadelphia, PA
Purpose/Objective(s):
We compared three AI-based large language model (LLM) search tools—o1 Pro Deep Research (o1), Grok 3 Deep Search (Grok3), and Perplexity Pro Deep Search (PDS)—for retrieving phase III clinical trial evidence supporting radiotherapy (RT). We repeated this query for multiple common and uncommon cancer types and hypothesized these state-of-the-art models would show similar accuracy and comprehensiveness.Materials/Methods:
Each model was given an identical prompt to identify phase III trials involving the use of RT for five common (breast, prostate, NSCLC, rectal, HPV+ oropharyngeal SCC) and three rare (hypopharyngeal SCC, medulloblastoma, vulvar) malignancies, plus a negative control (colon cancer). Two reviewers independently scored each response on a 0–2 rubric (0 = misleading; 1= generally accurate but lacking key details, 2 = correct and comprehensive), with a third refereeing any discordant scores. We recorded the number of included studies, accuracy of PubMed ID references or links, and number of incorrect statements per 100 words for each response. Overall response scores were assessed with Kruskal-Wallis tests with Dunn’s multiple comparisons testing. Other comparisons used Welch’s ANOVA with Dunnett’s T3 multiple comparisons testing.Results:
O1 generated an average of 9.8 verifiable studies with mean 2,639 words per response, while Grok3 generated an average 2 studies with mean 1558 words per response and PDS generated an average of 2.1 studies with 643 words per response. The three LLMs had significant differences in overall response scores, with o1 (median=2) scoring significantly higher when compared to PDS (median=1, p < 0.01) but not Grok3 (median=1, p = 0.07). There was also a significant difference (p=0.03) in the number of factual errors generated per one thousand words of reply by the LLMs, with o1 (mean=0.13) having significantly fewer errors than Grok3 (mean=1.4, p=0.05) but not PDS (mean=0.58, p=0.51). Additionally, there was a significant difference in the number of incorrect PubMed IDs provided by the models (p=0.03) with o1 (mean=0.13) providing significantly fewer incorrect references compared to PDS (mean=3.8, p=0.039) but not to Grok 3 (mean=2.9, p=0.1)Conclusion:
In this head-to-head comparison of LLM-based search tools, o1 Pro Deep Research consistently generated more comprehensive lists of phase III radiotherapy trials, with fewer factual errors per thousand words and fewer incorrect PubMed IDs than the other models. Overall o1 was highly accurate, outperforming Perplexity Pro Deep Search and Grok3 in several key metrics. Despite this, all three models did generate false claims and incomplete responses. These results highlight the variability in performance across advanced AI language models for literature retrieval and corroborate earlier reports of ChatGPT’s limitations in radiation oncology. Continued refinement of AI-based tools is needed to enhance both accuracy and comprehensiveness, ensuring more consistency and reliability.