关键词: GPT-3.5 GPT-4 artificial intelligence information science language model library science meta-analysis prompt engineering screening systematic review

Mesh : Humans Systematic Reviews as Topic Language

来  源:   DOI:10.2196/52758   PDF(Pubmed)

Abstract:
BACKGROUND: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers.
OBJECTIVE: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records.
METHODS: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study\'s inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included.
RESULTS: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria.
CONCLUSIONS: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.
摘要:
背景:系统评价的筛选过程是资源密集型的。尽管以前的机器学习解决方案已经报告了工作量的减少,他们冒着排除相关文件的风险。
目的:我们评估了使用GPT-3.5和GPT-4的3层筛选方法的性能,以简化系统评价的标题和摘要筛选过程。我们的目标是开发一种筛选方法,最大限度地提高识别相关记录的灵敏度。
方法:我们对2篇关于双相情感障碍治疗的系统综述进行了筛查,第一次审查有1381条记录,第二次审查有3146条记录。筛选使用GPT-3.5(gpt-3.5-turbo-0125)和GPT-4(gpt-4-0125-preview)跨三层进行:(1)研究设计,(2)目标患者,(3)干预和控制。使用针对每个研究定制的提示进行3层筛选。在这个过程中,根据每个研究的纳入标准进行信息提取,并使用基于GPT-4的流程进行筛选优化,无需人工调整.记录在每一层进行评估,并且在所有层都符合纳入标准的人随后被判定为包括在内。
结果:在每一层,GPT-3.5和GPT-4每分钟都能处理大约110条记录,筛选第一项和第二项研究所需的总时间约为1小时和2小时,分别。在第一项研究中,GPT-3.5和GPT-4的敏感性/特异性分别为0.900/0.709和0.806/0.996.通过GPT-3.5和GPT-4的筛查均判断了用于荟萃分析的所有6条记录。在第二项研究中,GPT-3.5和GPT-4的敏感性/特异性分别为0.958/0.116和0.875/0.855.相关记录的敏感性与人类评估者一致:第一项研究为0.867-1.000,第二项研究为0.776-0.979。通过GPT-3.5和GPT-4的筛查均判断了用于荟萃分析的所有9条记录。在考虑GPT-4合理排除的记录后,GPT-4筛查的敏感性/特异性在第一项研究中为0.962/0.996,在第二项研究中为0.943/0.855。进一步的调查表明,GPT-3.5错误排除的病例是由于缺乏领域知识,而GPT-4错误排除的病例是由于对纳入标准的误解.
结论:我们使用GPT-4的3层筛查方法显示出可接受的敏感性和特异性水平,支持其在系统评价筛查中的实际应用。未来的研究应旨在推广这种方法,并探索其在不同环境中的有效性,医学和非医学,充分确立其使用和操作可行性。
公众号