关键词: Brain tumor Large language model Magnetic resonance imaging Natural language processing

来  源:   DOI:10.1007/s00234-024-03427-7

Abstract:
OBJECTIVE: This study aimed to investigate the efficacy of fine-tuned large language models (LLM) in classifying brain MRI reports into pretreatment, posttreatment, and nontumor cases.
METHODS: This retrospective study included 759, 284, and 164 brain MRI reports for training, validation, and test dataset. Radiologists stratified the reports into three groups: nontumor (group 1), posttreatment tumor (group 2), and pretreatment tumor (group 3) cases. A pretrained Bidirectional Encoder Representations from Transformers Japanese model was fine-tuned using the training dataset and evaluated on the validation dataset. The model which demonstrated the highest accuracy on the validation dataset was selected as the final model. Two additional radiologists were involved in classifying reports in the test datasets for the three groups. The model\'s performance on test dataset was compared to that of two radiologists.
RESULTS: The fine-tuned LLM attained an overall accuracy of 0.970 (95% CI: 0.930-0.990). The model\'s sensitivity for group 1/2/3 was 1.000/0.864/0.978. The model\'s specificity for group1/2/3 was 0.991/0.993/0.958. No statistically significant differences were found in terms of accuracy, sensitivity, and specificity between the LLM and human readers (p ≥ 0.371). The LLM completed the classification task approximately 20-26-fold faster than the radiologists. The area under the receiver operating characteristic curve for discriminating groups 2 and 3 from group 1 was 0.994 (95% CI: 0.982-1.000) and for discriminating group 3 from groups 1 and 2 was 0.992 (95% CI: 0.982-1.000).
CONCLUSIONS: Fine-tuned LLM demonstrated a comparable performance with radiologists in classifying brain MRI reports, while requiring substantially less time.
摘要:
目的:本研究旨在研究微调大语言模型(LLM)在将脑MRI报告分类为预处理时的功效,后处理,和非肿瘤病例。
方法:这项回顾性研究包括759、284和164例脑MRI训练报告,验证,和测试数据集。放射科医生将报告分为三组:非肿瘤(第1组),治疗后肿瘤(第2组),治疗前肿瘤(组3)例。使用训练数据集对来自变形金刚日本模型的预训练的双向编码器表示进行微调,并在验证数据集上进行评估。选择在验证数据集上表现出最高准确性的模型作为最终模型。另外两名放射科医师参与对三组的测试数据集中的报告进行分类。将该模型在测试数据集上的性能与两名放射科医生的性能进行了比较。
结果:微调LLM的总体准确度为0.970(95%CI:0.930-0.990)。模型对1/2/3组的敏感性为1.000/0.864/0.978。模型对1/2/3组的特异性为0.991/0.993/0.958。在准确性方面没有发现统计学上的显着差异,灵敏度,以及LLM和人类读者之间的特异性(p≥0.371)。LLM完成分类任务的速度比放射科医师快大约20-26倍。区分第2组和第3组与第1组的受试者工作特征曲线下面积为0.994(95%CI:0.982-1.000),区分第3组与第1组和第2组的受试者工作特征曲线下面积为0.992(95%CI:0.982-1.000)。
结论:微调LLM在对脑部MRI报告进行分类方面与放射科医师表现出可比的表现,同时需要更少的时间。
公众号