Home
Sessions
Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics
Development and Validation of a Fact-Checking Algorithm to Enhance the Explainability of a Retrieval-Augmented Generation-Based Large Language Model for Radio-Oncological Clinical Questions

Main Session

Sep 30

PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3607 - Development and Validation of a Fact-Checking Algorithm to Enhance the Explainability of a Retrieval-Augmented Generation-Based Large Language Model for Radio-Oncological Clinical Questions

04:00pm - 05:00pm PT

Hall F

Screen: 4

POSTER

Presenter(s)

Annika Domres, - TUM Universitätsklinikum, Munich, Bavaria

A. Domres¹, J. Vladika¹, M. Nguyen¹, R. Moser¹, D. Bernhardt², S. E. Combs², K. Borm³, F. Matthes¹, and J. C. Peeken⁴; ¹Technische Universität München, Munich, Germany, ²German Consortium for Translational Cancer Research (DKTK), Partner Site Munich, Munich, Germany, ³Department of Radiation Oncology, TUM Unviversity Hospital, Munich, Germany, ⁴Department of Radiation Oncology, School of Medicine and Health and Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany

Purpose/Objective(s): Large Language Models (LLMs) represent a novel source to access medical information; however, they are prone to errors in complex clinical queries. Retrieval Augmented Generation (RAG) grounds generated responses in predefined reference material. This study aims to develop and validate a fact checking algorithm to enhance source traceability on a factual level, to detect LLM-generated misinformation and to correct such.

Materials/Methods: The employed LLM, GPT4o, received prompts (instruction), clinical questions and reference material (medical guidelines) within the RAG system. It extracted relevant data to generate a response, which was then decomposed into sub-facts (atomic facts) for fact checking. GPT4o assessed each sub-fact for accuracy and assigned a verdict (TRUE/FALSE). If the atomic fact was deemed FALSE, it was corrected and the overall answer re-generated. A validation set of 40 question & answer (QA) pairs on prostate cancer treatment was tested under varying instructions to determine an optimal strategy. The verdicts, as the outcome of the fact checking process, were compared to human assessments and key indicators were collected (see Table). In-context learning was explored using example-based prompting (1-shot vs. 4-shot) and an iterative loop to reprocess facts labeled FALSE.

To differentiate the individual impact of fact checking from RAG alone, both were applied to the oncological clinical open QA part of the benchmark “AMEGA” which is based on four oncological cases with 30 questions and 231 evaluation criteria [1]

Results: In a 4-shot prompting scenario for both, verdict and correction, supplemented by a loop the best results were achieved. With an F1-Score of 71%, 50% of the hallucinations and 58% of the inaccuracies were detected. This led to an improved overall answer in 42% of the cases. The increase in AMEGA points scored by the application of fact checking confirmed its improvement in answer quality (RAG only vs RAG + Fact Checking): Breast cancer 26 vs 27.5; Non-small cell lung cancer: 27.5 vs 30; Prostate cancer: 24.2 vs 24.2; Colon cancer: 33.5 vs 35.

Conclusion: Via atomic fact checking, explainability of LLM-generated responses was achieved. The occurrence of hallucinations was reduced, and overall response quality was improved. This technique could support safe clinical integration of such models in the future.

Abstract 3607 - Table 1

	Baseline [%]	4-shot + loop [%]
Sensitivity	62	78
Specificity	97	96
Precision PPV	71	66
F1 score	66	71
Accuracy	94	95
Overall Improved Answers	26	42
Overall Worsened Answers	20	2
Hallucination Rate	2	1
Hallucination Detection	25	50
Inaccuracy Rate	3	3
Inaccuracy Detection	50	58