METHODS: We used a mixed-methods design. Two independent clinicians\' manual process of literature selection was evaluated for 14 searches. This was followed by a series of simulations investigating the performance of random reading versus using screening prioritization based on active learning. We identified hard-to-find papers and checked the labels in a reflective dialogue.
METHODS: Inter-rater reliability was assessed using Cohen\'s Kappa (ĸ). To evaluate the performance of active learning, we used the Work Saved over Sampling at 95% recall (WSS@95) and percentage Relevant Records Found at reading only 10% of the total number of records (RRF@10). We used the average time to discovery (ATD) to detect records with potentially noisy labels. Finally, the accuracy of labeling was discussed in a reflective dialogue with guideline developers.
RESULTS: Mean ĸ for manual title-abstract selection by clinicians was 0.50 and varied between - 0.01 and 0.87 based on 5.021 abstracts. WSS@95 ranged from 50.15% (SD = 17.7) based on selection by clinicians to 69.24% (SD = 11.5) based on the selection by research methodologist up to 75.76% (SD = 12.2) based on the final full-text inclusion. A similar pattern was seen for RRF@10, ranging from 48.31% (SD = 23.3) to 62.8% (SD = 21.20) and 65.58% (SD = 23.25). The performance of active learning deteriorates with higher noise. Compared with the final full-text selection, the selection made by clinicians or research methodologists deteriorated WSS@95 by 25.61% and 6.25%, respectively.
CONCLUSIONS: While active machine learning tools can accelerate the process of literature screening within guideline development, they can only work as well as the input given by human raters. Noisy labels make noisy machine learning.
方法:我们使用了混合方法设计。对两名独立的临床医生手动文献选择过程进行了14次搜索评估。随后进行了一系列模拟,研究了随机阅读与使用基于主动学习的筛选优先级的性能。我们确定了难以找到的文件,并在反思对话中检查了标签。
方法:使用Cohen的Kappa(
结果:临床医生手动标题摘要选择的平均值为0.50,基于5.021摘要,在-0.01和0.87之间变化。WSS@95的范围从基于临床医生选择的50.15%(SD=17.7)到基于研究方法学家选择的69.24%(SD=11.5)到基于最终全文纳入的75.76%(SD=12.2)。对于RRF@10观察到类似的模式,范围从48.31%(SD=23.3)到62.8%(SD=21.20)和65.58%(SD=23.25)。主动学习的性能随着较高的噪声而恶化。与最终全文选择相比,临床医生或研究方法学家的选择使WSS@95下降了25.61%和6.25%,分别。
结论:虽然主动机器学习工具可以加速指南开发中的文献筛选过程,它们只能像人类评估者提供的输入一样工作。嘈杂的标签使机器学习变得嘈杂。