3746 - Can Generative AI Agents Match the Experts? Measuring Large Language Model Performance in Evaluating NSCLC Treatment Plans
Presenter(s)

J. Xia1, J. Tam2, S. Chen2, A. Yu2, R. Samstein3, K. Rosenzweig4, M. Chao3, T. Liu4, and J. Zhang3; 1Icahn School of Medicine at Mount Sinai, NEW YORK, NY, 2Icahn School of Medicine at Mount Sinai, New York, NY, 3Icahn School of Medicine at Mount Sinai, Department of Radiation Oncology, New York, NY, 4Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY
Purpose/Objective(s): To investigate the feasibility of using generative AI agents to evaluate NSCLC treatment plans. By comparing AI agents-generated recommendations to experts' decisions, we aimed to assess the feasibility of using large language models (LLMs) in clinical decision support.
Materials/Methods: We developed and evaluated two AI plan evaluation strategies for NSCLC treatment plans: (1) a multi-agent-based evaluator utilizing a consensus mechanism and (2) a single-LLM-based evaluator. In the multi-agent plan evaluator, four AI agents collaboratively assessed treatment plans. Three AI agents independently evaluated plan quality by analyzing dose-volume histogram metrics, using different LLMs (Llama 3.3, Deepseek R1, and GPT-4o), prior clinical knowledge, and clinical guidelines; these three agents then determined a final consensus by majority vote, which the fourth agent reviewed and verified. In the single-LLM approach, a single AI agent performed the entire evaluation using one of the LLMs.
A total of 35 lung plan pairs were used for evaluation. The ground truth was computed by a majority vote among three human experts. We compared the performance of both AI evaluators and the human experts using accuracy, F1 score, and Cohen’s kappa, which was used for measuring inter-evaluator agreement. The AI systems and human experts documented the rationale for their decisions.
Results: Out of the 70 treatment plans evaluated, the multi-agent-based evaluator demonstrated strong agreement with the human experts, achieving an F1 score of 0.97, an accuracy of 0.94, and a Cohen’s kappa of 0.77—identical to Experts #2 and #3. By contrast, Expert #1’s performance was moderately lower, with an F1 score of 0.88, an accuracy of 0.93, and a Cohen’s kappa of 0.60. The single-LLM-based evaluator performed below the multi-agent-based evaluator, with an accuracy ranging from 0.88 to 0.91, an F1 score from 0.93 to 0.95, and a Cohen’s kappa between 0.53 and 0.72. Our results suggest that the AI plan evaluator with the consensus mechanism closely aligns with the majority of expert clinical assessments and achieves a high accuracy and consistency in evaluating NSCLC treatment plans.
Conclusion: This study demonstrates that a multi-agent-based AI plan evaluator, driven by a consensus mechanism, can achieve high accuracy and strong agreement with expert assessments in evaluating NSCLC treatment plans. Utilizing advanced AI reasoning capabilities, we may integrate this approach into clinical decision support, making it an important component of automated treatment planning workflows.
Abstract 3746 - Table 1Accuracy | F1 Score | Cohen's Kappa | |
Multi-agent-based Plan Evaluator | 0.94 | 0.97 | 0.77 |
Expert #1 | 0.88 | 0.93 | 0.60 |
Expert #2 | 0.94 | 0.97 | 0.77 |
Expert #3 | 0.94 | 0.97 | 0.77 |
Single-LLM-based Plan Evaluator (GPT 4o) | 0.88 | 0.93 | 0.53 |
Single-LLM-based Plan Evaluator (Llama 3.3) | 0.91 | 0.95 | 0.72 |
Single-LLM-based Plan Evaluator (Deepseek R1) | 0.91 | 0.95 | 0.68 |