3607 - Development and Validation of a Fact-Checking Algorithm to Enhance the Explainability of a Retrieval-Augmented Generation-Based Large Language Model for Radio-Oncological Clinical Questions
Presenter(s)

A. Domres1, J. Vladika1, M. Nguyen1, R. Moser1, D. Bernhardt2, S. E. Combs2, K. Borm3, F. Matthes1, and J. C. Peeken4; 1Technische Universität München, Munich, Germany, 2German Consortium for Translational Cancer Research (DKTK), Partner Site Munich, Munich, Germany, 3Department of Radiation Oncology, TUM Unviversity Hospital, Munich, Germany, 4Department of Radiation Oncology, School of Medicine and Health and Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany
Purpose/Objective(s): Large Language Models (LLMs) represent a novel source to access medical information; however, they are prone to errors in complex clinical queries. Retrieval Augmented Generation (RAG) grounds generated responses in predefined reference material. This study aims to develop and validate a fact checking algorithm to enhance source traceability on a factual level, to detect LLM-generated misinformation and to correct such.
Materials/Methods: The employed LLM, GPT4o, received prompts (instruction), clinical questions and reference material (medical guidelines) within the RAG system. It extracted relevant data to generate a response, which was then decomposed into sub-facts (atomic facts) for fact checking. GPT4o assessed each sub-fact for accuracy and assigned a verdict (TRUE/FALSE). If the atomic fact was deemed FALSE, it was corrected and the overall answer re-generated. A validation set of 40 question & answer (QA) pairs on prostate cancer treatment was tested under varying instructions to determine an optimal strategy. The verdicts, as the outcome of the fact checking process, were compared to human assessments and key indicators were collected (see Table). In-context learning was explored using example-based prompting (1-shot vs. 4-shot) and an iterative loop to reprocess facts labeled FALSE.
To differentiate the individual impact of fact checking from RAG alone, both were applied to the oncological clinical open QA part of the benchmark “AMEGA” which is based on four oncological cases with 30 questions and 231 evaluation criteria [1]
Results: In a 4-shot prompting scenario for both, verdict and correction, supplemented by a loop the best results were achieved. With an F1-Score of 71%, 50% of the hallucinations and 58% of the inaccuracies were detected. This led to an improved overall answer in 42% of the cases. The increase in AMEGA points scored by the application of fact checking confirmed its improvement in answer quality (RAG only vs RAG + Fact Checking): Breast cancer 26 vs 27.5; Non-small cell lung cancer: 27.5 vs 30; Prostate cancer: 24.2 vs 24.2; Colon cancer: 33.5 vs 35.
Conclusion: Via atomic fact checking, explainability of LLM-generated responses was achieved. The occurrence of hallucinations was reduced, and overall response quality was improved. This technique could support safe clinical integration of such models in the future.
Abstract 3607 - Table 1
Baseline [%] | 4-shot + loop [%] | |
Sensitivity | 62 | 78 |
Specificity | 97 | 96 |
Precision PPV | 71 | 66 |
F1 score | 66 | 71 |
Accuracy | 94 | 95 |
Overall Improved Answers | 26 | 42 |
Overall Worsened Answers | 20 | 2 |
Hallucination Rate | 2 | 1 |
Hallucination Detection | 25 | 50 |
Inaccuracy Rate | 3 | 3 |
Inaccuracy Detection | 50 | 58 |