323 - Quantifying Treatment Plan Quality in Non-Small Cell Lung Cancer Radiotherapy Based on Pairwise Preferences
Presenter(s)

J. Zhang1, Y. Lei2, J. Tam2, A. Yu2, S. Chen2, R. Samstein3, K. Rosenzweig4, M. Chao1, T. Liu4, and J. Xia5; 1Icahn School of Medicine at Mount Sinai, Department of Radiation Oncology, New York, NY, 2Icahn School of Medicine at Mount Sinai, New York, NY, 3Memorial Sloan Kettering Cancer Center, New York, NY, 4Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, 5Icahn School of Medicine at Mount Sinai, NEW YORK, NY
Purpose/Objective(s):
Objective treatment plan evaluation is critical for standardizing plan quality and improving clinical efficiency. Previous studies have demonstrated that structured scoring mechanisms can guide planners toward improved dosimetric outcomes. However, manually constructed scoring methods rely on predefined metrics and may not fully capture complex tradeoffs inherent in treatment planning. To address this limitation, we aim to develop a data-driven scoring model that approximates expert plan evaluation using pairwise preference-based learning, leveraging a large language model (LLM) as a scalable surrogate for human plan evaluators.Materials/Methods:
To develop a human-aligned scoring function, 35 lung plan pairs were created and evaluated by two experienced planners, each selecting their preferred plans. To expand the pairwise preference dataset while maintaining expert-level decision-making, an LLM agent was developed to generate human-level preference assessments based on features derived from 10 clinically relevant dose-volume metrics. After validated against the human planner feedback, the LLM agent was instructed to evaluate additional plans from another dataset consisting of 1,240 plan pairs (62 lung patients, 32 plans each, 20 randomly sampled pairs per patient). A scoring model was constructed using binary cross-entropy with logit loss to learn preference-based scoring. The agreement between planners was assessed using accuracy and F1 score. To quantify the inter-planner variations, inter-planner agreement was also evaluated. To evaluate the improvement due to the LLM agent, a score function was separately trained and validated using only human feedback with five-fold cross-validation.Results:
The LLM-generated responses demonstrated strong agreement with human evaluators, achieving an accuracy of 0.91 and an F1 score of 0.95. When trained solely on human evaluator input, the scoring model attained an accuracy of 0.73 and an F1 score of 0.76. Incorporating LLM-generated feedback significantly improved model performance, resulting in an accuracy of 0.82 and an F1 score of 0.84, which closely aligns with inter-planner variability observed among human evaluators (accuracy: 0.83, F1 score: 0.89).Conclusion:
This study introduces a novel framework for treatment plan evaluation, leveraging pairwise preference-based learning and LLM-generated feedback to approximate expert decision-making. Compared to manually constructed scoring mechanisms, our approach dynamically learns from expert feedback, enabling a more flexible and generalizable assessment of plan quality. This model offers a scalable solution for planner training, clinical plan assessment, and automated treatment planning.