基于异构图神经网络的药品文本关系抽取研究A study on pharmaceutical text relationship extraction based on heterogeneous graph neural networks.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Effective information extraction of pharmaceutical texts is of great significance for clinical research. The ancient Chinese medicine text has streamlined sentences and complex semantic relationships, and the textual relationships may exist between heterogeneous entities. The current mainstream relationship extraction model does not take into account the associations between entities and relationships when extracting, resulting in insufficient semantic information to form an effective structured representation. In this paper, we propose a heterogeneous graph neural network relationship extraction model adapted to traditional Chinese medicine (TCM) text. First, the given sentence and predefined relationships are embedded by bidirectional encoder representation from transformers (BERT fine-tuned) word embedding as model input. Second, a heterogeneous graph network is constructed to associate words, phrases, and relationship nodes to obtain the hidden layer representation. Then, in the decoding stage, two-stage subject-object entity identification method is adopted, and the identifier adopts a binary classifier to locate the start and end positions of the TCM entities, identifying all the subject-object entities in the sentence, and finally forming the TCM entity relationship group. Through the experiments on the TCM relationship extraction dataset, the results show that the precision value of the heterogeneous graph neural network embedded with BERT is 86.99% and the F1 value reaches 87.40%, which is improved by 8.83% and 10.21% compared with the relationship extraction models CNN, Bert-CNN, and Graph LSTM.

摘要：

药物文本的有效信息提取对临床研究具有重要意义。古代中医文本句子精简,语义关系复杂,和文本关系可能存在于异构实体之间。当前主流的关系抽取模型在抽取时没有考虑实体和关系之间的关联,导致语义信息不足，无法形成有效的结构化表示。在本文中,提出了一种适用于中医文本的异构图神经网络关系提取模型。首先,给定的句子和预定义的关系通过双向编码器表示从转换器（BERT微调）单词嵌入作为模型输入。第二,构建了一个异构的图网络来关联单词，短语,和关系节点以获得隐藏层表示。然后，在解码阶段，采用两阶段主客体实体识别方法，标识符采用二元分类器定位TCM实体的起始位置和终止位置，识别句子中的所有主客体实体，最终形成中医实体关系小组。通过对中医关系提取数据集的实验,结果表明，嵌入BERT的异构图神经网络的精度值为86.99%，F1值达到87.40%，与关系提取模型CNN相比，分别提高了8.83%和10.21%，Bert-CNN,和GraphLSTM。