Main Session
Sep 30
PQA 09 - Hematologic Malignancies, Health Services Research, Digital Health Innovation and Informatics

3607 - Development and Validation of a Fact-Checking Algorithm to Enhance the Explainability of a Retrieval-Augmented Generation-Based Large Language Model for Radio-Oncological Clinical Questions

04:00pm - 05:00pm PT
Hall F
Screen: 4
POSTER

Presenter(s)

Annika Domres Headshot
Annika Domres, - TUM Universitätsklinikum, Munich, Bavaria

A. Domres1, J. Vladika1, M. Nguyen1, R. Moser1, D. Bernhardt2, S. E. Combs2, K. Borm3, F. Matthes1, and J. C. Peeken4; 1Technische Universität München, Munich, Germany, 2German Consortium for Translational Cancer Research (DKTK), Partner Site Munich, Munich, Germany, 3Department of Radiation Oncology, TUM Unviversity Hospital, Munich, Germany, 4Department of Radiation Oncology, School of Medicine and Health and Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany

Purpose/Objective(s): Large Language Models (LLMs) represent a novel source to access medical information; however, they are prone to errors in complex clinical queries. Retrieval Augmented Generation (RAG) grounds generated responses in predefined reference material. This study aims to develop and validate a fact checking algorithm to enhance source traceability on a factual level, to detect LLM-generated misinformation and to correct such.

Materials/Methods: The employed LLM, GPT4o, received prompts (instruction), clinical questions and reference material (medical guidelines) within the RAG system. It extracted relevant data to generate a response, which was then decomposed into sub-facts (atomic facts) for fact checking. GPT4o assessed each sub-fact for accuracy and assigned a verdict (TRUE/FALSE). If the atomic fact was deemed FALSE, it was corrected and the overall answer re-generated. A validation set of 40 question & answer (QA) pairs on prostate cancer treatment was tested under varying instructions to determine an optimal strategy. The verdicts, as the outcome of the fact checking process, were compared to human assessments and key indicators were collected (see Table). In-context learning was explored using example-based prompting (1-shot vs. 4-shot) and an iterative loop to reprocess facts labeled FALSE.

To differentiate the individual impact of fact checking from RAG alone, both were applied to the oncological clinical open QA part of the benchmark “AMEGA” which is based on four oncological cases with 30 questions and 231 evaluation criteria [1]

Results: In a 4-shot prompting scenario for both, verdict and correction, supplemented by a loop the best results were achieved. With an F1-Score of 71%, 50% of the hallucinations and 58% of the inaccuracies were detected. This led to an improved overall answer in 42% of the cases. The increase in AMEGA points scored by the application of fact checking confirmed its improvement in answer quality (RAG only vs RAG + Fact Checking): Breast cancer 26 vs 27.5; Non-small cell lung cancer: 27.5 vs 30; Prostate cancer: 24.2 vs 24.2; Colon cancer: 33.5 vs 35.

Conclusion: Via atomic fact checking, explainability of LLM-generated responses was achieved. The occurrence of hallucinations was reduced, and overall response quality was improved. This technique could support safe clinical integration of such models in the future.

Abstract 3607 - Table 1

Baseline [%]

4-shot + loop [%]

Sensitivity

62

78

Specificity

97

96

Precision PPV

71

66

F1 score

66

71

Accuracy

94

95

Overall Improved Answers

26

42

Overall Worsened Answers

20

2

Hallucination Rate

2

1

Hallucination Detection

25

50

Inaccuracy Rate

3

3

Inaccuracy Detection

50

58