使用大型语言模型进行临床评论的自动化论文筛选：数据分析研究。Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources.
OBJECTIVE: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers.
METHODS: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts.
RESULTS: Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications.
CONCLUSIONS: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

摘要：

背景：临床研究论文的系统回顾是一个劳动密集型且耗时的过程，通常涉及对数千个标题和摘要的筛选。此过程的准确性和效率对于审查和后续医疗保健决策的质量至关重要。传统方法严重依赖人类审稿人，通常需要大量的时间和资源投入。
目的：本研究旨在评估OpenAI生成预训练变压器（GPT）和GPT-4应用程序编程接口（API）在准确有效地从现实世界中识别相关标题和摘要方面的性能。临床评论数据集，并将其性能与2位独立人类审阅者的真实标签进行比较。
方法：我们介绍了一种新颖的工作流程，使用ChatGPT和GPT-4API在临床综述中筛选标题和摘要。创建了一个Python脚本，以调用具有自然语言筛选标准的API，以及由至少2名人类审阅者过滤的标题和抽象数据集的语料库。我们将我们的模型与6篇综述论文中的人类综述论文的性能进行了比较，筛选超过24,000个标题和摘要。
结果：我们的结果显示准确度为0.91，宏观F1评分为0.60，排除论文的敏感性为0.91，纳入论文的敏感性为0.76。2个独立的人类筛查者之间的评分者间差异为κ=0.46，而我们提出的方法与基于共识的人类决策之间的患病率和偏倚调整的κ为κ=0.96。在随机选择的论文子集上，GPT模型证明了能够为其决策提供推理的能力，并在被要求解释错误分类的推理时纠正了其最初的决策。
结论：大型语言模型有可能简化临床审查过程，为研究人员节省宝贵的时间和精力，并有助于提高临床评价的整体质量。通过优先考虑工作流程，并作为研究人员和审稿人的辅助而不是替代，GPT-4等模型可以提高效率，并在医学研究中得出更准确可靠的结论。