3761 - Leveraging a Large Language Model to Extract Structured Data Elements from Unstructured Radiology and Pathology Reports
Presenter(s)
S. Zhao1, D. Davila-Garcia2, K. Kiser2,3, and A. Wilcox2; 1Department of Radiation Oncology, WashU Medicine, St. Louis, MO, 2Institute for Informatics, Data Science & Biostatistics, WashU Medicine, St. Louis, MO, 3Department of Radiation Oncology, Washington University School of Medicine in St. Louis, St. Louis, MO
Purpose/Objective(s): Radiologist reports and pathologist narratives contain valuable information that is not encoded in the electronic health record in a structured format. Large language models (LLMs), such as OpenAI’s ChatGPT, offer a promising approach for automating clinical data extraction. This study evaluated the performance of ChatGPT-4o mini in extracting six prognostic variables from unstructured radiology and pathology reports of radiotherapy patients.
Materials/Methods: From a dataset of 38,262 oncology patients evaluated in a radiation oncology department, we identified patient cohorts for the assessment of leptomeningeal disease (LMD), ascites, pleural effusion, breast cancer hormone receptor status (ER, PR, HER2), Gleason score (primary and secondary), and surgical margin status. For each variable, 100–200 unique radiology or pathology reports were randomly selected. For variables with low prevalence (i.e. LMD and ascites), the random samples were enriched with additional reports that matched the regular expressions “leptomening” and “ascit,” respectively. All unstructured reports were processed using a HIPAA-compliant ChatGPT-4o mini model with zero-shot prompting. The model outputs were compared against gold-standard consensus annotations from a radiation oncologist, a medical student, and a biomedical informaticist. Performance was summarized using F1 Scores, with means and standard deviations calculated from 1,000 bootstrapped samples (Table).
Results: A total of 908 unique radiology and pathology reports were analyzed. For variables with low prevalence, LMD and ascites, the LLM showed moderate extraction performance in the initial random sample (F1 scores 0.47 and 0.92, respectively). When those samples were enriched with targeted keywords, the model exhibited strong extraction (F1 scores 0.79 and 1.00, respectively). The LLM demonstrated excellent performance in identifying variables that were more common (pleural effusion, F1 1.00) or reliably reported (primary and secondary Gleason scores, F1s 0.95 and 0.96; breast cancer receptors, ER 0.95, PR 0.90, HER2 0.90; positive surgical margins 0.96).
Conclusion: ChatGPT-4o mini effectively extracted complex clinical variables from unstructured radiology and pathology reports, producing structured outputs with high accuracy and F1 scores. Notably, for low-prevalence variables, performance improved markedly after a straightforward pre-processing enrichment step. These findings highlight the potential application of LLMs for automating clinical data extraction in radiation oncology.
Abstract 3761 - Table 1
Prognostic Variable | N | % positive | F1 Score ± SD |
Radiology | |||
LMD (random) | 100 | 1% | 0.47 ± 0.12 |
LMD (enriched) | 100 | 47% | 0.79 ± 0.04 |
Ascites (random) | 100 | 7% | 0.92 ± 0.07 |
Ascites (enriched) | 100 | 20% | 1.00 ± 0.00 |
Pleural Effusion | 100 | 23% | 1.00 ± 0.00 |
Pathology | |||
Gleason - Primary | 149 | - | 0.95 ± 0.03 |
Gleason - Secondary | 149 | - | 0.96 ± 0.02 |
ER | 100 | 76% | 0.95 ± 0.04 |
PR | 100 | 44% | 0.90 ± 0.03 |
HER2 | 100 | 5% | 0.90 ± 0.10 |
Margins | 159 | 33% | 0.96 ± 0.02 |