机器学习在医学指南开发中优化文献筛选。Machine learning to optimize literature screening in medical guideline development.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

OBJECTIVE: In a time of exponential growth of new evidence supporting clinical decision-making, combined with a labor-intensive process of selecting this evidence, methods are needed to speed up current processes to keep medical guidelines up-to-date. This study evaluated the performance and feasibility of active learning to support the selection of relevant publications within medical guideline development and to study the role of noisy labels.
METHODS: We used a mixed-methods design. Two independent clinicians\' manual process of literature selection was evaluated for 14 searches. This was followed by a series of simulations investigating the performance of random reading versus using screening prioritization based on active learning. We identified hard-to-find papers and checked the labels in a reflective dialogue.
METHODS: Inter-rater reliability was assessed using Cohen\'s Kappa (ĸ). To evaluate the performance of active learning, we used the Work Saved over Sampling at 95% recall (WSS@95) and percentage Relevant Records Found at reading only 10% of the total number of records (RRF@10). We used the average time to discovery (ATD) to detect records with potentially noisy labels. Finally, the accuracy of labeling was discussed in a reflective dialogue with guideline developers.
RESULTS: Mean ĸ for manual title-abstract selection by clinicians was 0.50 and varied between - 0.01 and 0.87 based on 5.021 abstracts. WSS@95 ranged from 50.15% (SD = 17.7) based on selection by clinicians to 69.24% (SD = 11.5) based on the selection by research methodologist up to 75.76% (SD = 12.2) based on the final full-text inclusion. A similar pattern was seen for RRF@10, ranging from 48.31% (SD = 23.3) to 62.8% (SD = 21.20) and 65.58% (SD = 23.25). The performance of active learning deteriorates with higher noise. Compared with the final full-text selection, the selection made by clinicians or research methodologists deteriorated WSS@95 by 25.61% and 6.25%, respectively.
CONCLUSIONS: While active machine learning tools can accelerate the process of literature screening within guideline development, they can only work as well as the input given by human raters. Noisy labels make noisy machine learning.

摘要：

目的：在支持临床决策的新证据呈指数级增长的时期，结合选择这些证据的劳动密集型过程，需要一些方法来加快当前流程，以使医疗指南保持最新。这项研究评估了主动学习的性能和可行性，以支持在医学指南开发中选择相关出版物并研究嘈杂标签的作用。
方法：我们使用了混合方法设计。对两名独立的临床医生手动文献选择过程进行了14次搜索评估。随后进行了一系列模拟，研究了随机阅读与使用基于主动学习的筛选优先级的性能。我们确定了难以找到的文件，并在反思对话中检查了标签。
方法：使用Cohen的Kappa（）评估评分者间的可靠性。为了评估主动学习的表现，我们使用了95%召回时保存的采样工作(WSS@95)和仅读取记录总数10%时发现的相关记录百分比(RRF@10)。我们使用平均发现时间(ATD)来检测具有潜在噪声标签的记录。最后,在与指南开发者的反思对话中讨论了标签的准确性。
结果：临床医生手动标题摘要选择的平均值为0.50，基于5.021摘要，在-0.01和0.87之间变化。WSS@95的范围从基于临床医生选择的50.15%(SD=17.7)到基于研究方法学家选择的69.24%(SD=11.5)到基于最终全文纳入的75.76%(SD=12.2)。对于RRF@10观察到类似的模式，范围从48.31％（SD=23.3）到62.8％（SD=21.20）和65.58％（SD=23.25）。主动学习的性能随着较高的噪声而恶化。与最终全文选择相比，临床医生或研究方法学家的选择使WSS@95下降了25.61%和6.25%，分别。
结论：虽然主动机器学习工具可以加速指南开发中的文献筛选过程，它们只能像人类评估者提供的输入一样工作。嘈杂的标签使机器学习变得嘈杂。