关键词: Data imbalance Drug blinding Drug-drug interactions Label-noise

来  源:   DOI:10.1016/j.jbi.2022.104192

Abstract:
The extraction of drug-drug interactions (DDIs) is an important task in the field of biomedical research, which can reduce unexpected health risks during patient treatment. Previous work indicates that methods using external drug information have a much higher performance than those methods not using it. However, the use of external drug information is time-consuming and resource-costly. In this work, we propose a novel method for extracting DDIs which does not use external drug information, but still achieves comparable performance. First, we no longer convert the drug name to standard tokens such as DRUG0, the method commonly used in previous research. Instead, full drug names with drug entity marking are input to BioBERT, allowing us to enhance the selected drug entity pair. Second, we adopt the Key Semantic Sentence approach to emphasize the words closely related to the DDI relation of the selected drug pair. After the above steps, the misclassification of similar instances which are created from the same sentence but corresponding to different pairs of drug entities can be significantly reduced. Then, we employ the Gradient Harmonizing Mechanism (GHM) loss to reduce the weight of mislabeled instances and easy-to-classify instances, both of which can lead to poor performance in DDI extraction. Overall, we demonstrate in this work that it is better not to use drug blinding with BioBERT, and show that GHM performs better than Cross-Entropy loss if the proportion of label noise is less than 30%. The proposed model achieves state-of-the-art results with an F1-score of 84.13% on the DDIExtraction 2013 corpus (a standard English DDI corpus), which fills the performance gap (4%) between methods that rely on and do not rely on external drug information.
摘要:
药物-药物相互作用(DDIs)的提取是生物医学研究领域的一项重要任务,这可以减少患者治疗期间意外的健康风险。先前的工作表明,使用外部药物信息的方法比不使用它的方法具有更高的性能。然而,使用外部药物信息既耗时又耗费资源。在这项工作中,我们提出了一种新的方法来提取DDI,该方法不使用外部药物信息,但仍然实现了可比的性能。首先,我们不再将药物名称转换为标准标记,例如DRUG0,这是以前研究中常用的方法。相反,带有药物实体标记的药物全名被输入到Biobert,允许我们增强选定的药物实体对。第二,我们采用关键语义句方法来强调与所选药物对的DDI关系密切相关的单词。经过上述步骤,从相同句子创建但对应于不同的药物实体对的相似实例的错误分类可以显著减少。然后,我们采用梯度协调机制(GHM)损失来减少错误标记的实例和易于分类的实例的重量,这两者都可能导致DDI提取性能不佳。总的来说,我们在这项工作中证明,最好不要使用Biobert药物致盲,并且表明,如果标签噪声的比例小于30%,则GHM的性能优于交叉熵损失。所提出的模型在DDIstraction2013语料库(标准英语DDI语料库)上获得了84.13%的F1分数,这填补了依赖和不依赖外部药物信息的方法之间的性能差距(4%)。
公众号