可视化自动评估健康新闻质量的标准驱动系统的解释：两种方法的探索性研究。Visualizing the Interpretation of a Criteria-Driven System That Automatically Evaluates the Quality of Health News: Exploratory Study of 2 Approaches.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Machine learning techniques have been shown to be efficient in identifying health misinformation, but the results may not be trusted unless they can be justified in a way that is understandable.
OBJECTIVE: This study aimed to provide a new criteria-based system to assess and justify health news quality. Using a subset of an existing set of criteria, this study compared the feasibility of 2 alternative methods for adding interpretability. Both methods used classification and highlighting to visualize sentence-level evidence.
METHODS: A total of 3 out of 10 well-established criteria were chosen for experimentation, namely whether the health news discussed the costs of the intervention (the cost criterion), explained or quantified the harms of the intervention (the harm criterion), and identified the conflicts of interest (the conflict criterion). The first step of the experiment was to automate the evaluation of the 3 criteria by developing a sentence-level classifier. We tested Logistic Regression, Naive Bayes, Support Vector Machine, and Random Forest algorithms. Next, we compared the 2 visualization approaches. For the first approach, we calculated word feature weights, which explained how classification models distill keywords that contribute to the prediction; then, using the local interpretable model-agnostic explanation framework, we selected keywords associated with the classified criterion at the document level; and finally, the system selected and highlighted sentences with keywords. For the second approach, we extracted sentences that provided evidence to support the evaluation result from 100 health news articles; based on these results, we trained a typology classification model at the sentence level; and then, the system highlighted a positive sentence instance for the result justification. The number of sentences to highlight was determined by a preset threshold empirically determined using the average accuracy.
RESULTS: The automatic evaluation of health news on the cost, harm, and conflict criteria achieved average area under the curve scores of 0.88, 0.76, and 0.73, respectively, after 50 repetitions of 10-fold cross-validation. We found that both approaches could successfully visualize the interpretation of the system but that the performance of the 2 approaches varied by criterion and highlighting the accuracy decreased as the number of highlighted sentences increased. When the threshold accuracy was ≥75%, this resulted in a visualization with a variable length ranging from 1 to 6 sentences.
CONCLUSIONS: We provided 2 approaches to interpret criteria-based health news evaluation models tested on 3 criteria. This method incorporated rule-based and statistical machine learning approaches. The results suggested that one might visually interpret an automatic criterion-based health news quality evaluation successfully using either approach; however, larger differences may arise when multiple quality-related criteria are considered. This study can increase public trust in computerized health information evaluation.

摘要：

背景：机器学习技术已被证明在识别健康错误信息方面是有效的，但是结果可能是不可信的，除非它们能够以一种可以理解的方式被证明是合理的。
目的：本研究旨在提供一种新的基于标准的系统来评估和证明健康新闻质量。使用现有标准集的子集，这项研究比较了两种增加可解释性的替代方法的可行性。两种方法都使用分类和突出显示来可视化句子级别的证据。
方法：总共选择了10个完善的标准中的3个进行实验，即健康新闻是否讨论了干预的成本(成本标准)，解释或量化干预的危害(危害标准)，并确定了利益冲突(冲突标准)。实验的第一步是通过开发句子级分类器来自动评估3个标准。我们测试了Logistic回归，天真的贝叶斯,支持向量机,和随机森林算法。接下来,我们比较了两种可视化方法。对于第一种方法，我们计算了单词特征权重，它解释了分类模型如何提取有助于预测的关键词；然后，使用本地可解释的模型不可知的解释框架，我们在文档级别选择了与分类标准相关的关键字；最后，系统选择并突出显示带有关键字的句子。对于第二种方法，我们从100篇健康新闻中提取了提供支持评估结果的证据的句子；基于这些结果，我们在句子级别训练了一个类型学分类模型；然后，系统突出显示了一个积极的句子实例，用于结果证明。要突出显示的句子的数量由使用平均准确度凭经验确定的预设阈值确定。
结果：健康新闻对成本的自动评估，伤害,和冲突标准的平均曲线下面积得分分别为0.88、0.76和0.73，经过50次重复的10倍交叉验证。我们发现两种方法都可以成功地可视化系统的解释，但是两种方法的性能因标准而异，并且随着突出显示的句子数量的增加，突出显示的准确性降低。当阈值精度≥75%时，这导致了一个可视化的可变长度范围从1到6个句子。
结论：我们提供了2种方法来解释基于3个标准的健康新闻评估模型。该方法结合了基于规则和统计机器学习方法。结果表明，可以使用两种方法成功地从视觉上解释基于标准的自动健康新闻质量评估；但是，当考虑多个质量相关标准时，可能会出现更大的差异。这项研究可以增加公众对计算机化健康信息评估的信任。