关键词: biodiversity information extraction relation extraction rule-based methods transformer models unsupervised methods

来  源:   DOI:10.3389/frai.2024.1371411   PDF(Pubmed)

Abstract:
UNASSIGNED: Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species\' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species\' time-sensitive reproductive conditions and location-specific habitats.
UNASSIGNED: We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches.
UNASSIGNED: Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.
摘要:
细粒度,有关植物栖息地和生殖条件的描述性信息对于森林恢复和恢复工作至关重要。水果采集的精确时间和物种栖息地偏好和生殖状况的知识是必要的,特别是对于具有短命顽固种子的热带植物物种,那些表现出复杂生殖模式的人,例如,具有可能以不规则间隔发生的年度大规模开花事件的物种。通过提供对结构化信息的访问,可以帮助以计划有效造林的方式了解植物再生,例如,在知识库中,这跨越了几年,如果不是几十年,以及覆盖广泛的地理位置。这种资源的内容可以通过文献中有关物种对时间敏感的生殖条件和特定位置的栖息地的信息来丰富。
我们试图开发无监督的方法来提取与栖息地及其位置有关的关系,以及植物物种的繁殖条件和相应的时间信息。首先,我们为传统的基于规则的模式匹配方法手工制作了规则。然后,我们开发了一种基于变压器模型的关系提取方法,即,文本到文本转换转换器(T5),将关系提取问题转换为问答和自然语言推理任务。然后,我们提出了一种新颖的无监督混合方法,该方法结合了基于规则和基于变压器的方法。
对我们的混合方法在以生物多样性为重点的文档的注释语料库上的评估表明,与仅基于规则和基于变压器的方法相比,在召回和最佳性能方面提高了多达15个百分点,对于生殖条件-时间表达关系,F1得分范围从89.61到96.75%,栖息地-地理位置关系从85.39%到89.90%不等。我们的工作表明,即使没有在任何特定领域的标记数据集上训练模型,我们能够从令人满意的表现的文献中提取生物多样性概念之间的关系。
公众号