关键词: Biomedical relation extraction Large language models Relation classification SemMedDB SemRep

Mesh : Natural Language Processing Data Mining / methods Semantics MEDLINE PubMed Algorithms Humans Databases, Factual

来  源:   DOI:10.1016/j.jbi.2024.104658

Abstract:
OBJECTIVE: Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB.
METHODS: We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts.
RESULTS: Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct.
CONCLUSIONS: These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.
摘要:
目的:关系提取是生物医学文献挖掘领域的一项重要任务,为各种下游应用提供了显着的好处,包括数据库策展,药物再利用,和基于文献的发现。广泛覆盖的自然语言处理(NLP)工具SemRep为从生物医学文本中提取主语-谓语-宾语三元组建立了坚实的基线,并作为语义MEDLINE数据库(SemMedDB)的骨干。语义三元组的PubMed规模存储库。虽然SemRep达到了合理的精度(0.69),它的召回率相对较低(0.42)。在这项研究中,我们的目标是使用关系分类方法来增强SemRep,以最终增加SemMedDB的大小和效用。
方法:我们组合并扩展了现有的SemRep评估数据集以生成训练数据。我们利用了预先训练的PubMedBERT模型,通过额外的对比预训练和微调来增强它。我们尝试了三个实体表示:提及,语义类型,和语义组。我们在SemRepGold标准数据集的一部分上评估了模型性能,并将其与SemRep性能进行了比较。我们还评估了模型对更大的12K随机选择的PubMed摘要的影响。
结果:我们的结果表明,最佳模型的精度为0.62,召回率为0.81,F1评分为0.70。对12K摘要的评估表明,该模型可以将SemMedDB的大小增加一倍,当应用于整个PubMed时。我们还手动评估了SemRep先前未识别的模型预测的506个三元组的质量,发现这些三元组中有67%是正确的。
结论:这些发现强调了我们的模型在实现生物医学文献中提到的关系的更全面覆盖方面的承诺。从而显示出其在增强生物医学文献挖掘的各种下游应用方面的潜力。与本研究相关的数据和代码可在https://github.com/Michelle-Mings/SemRep_Relationship上获得。
公众号