背景:生物学研究正在产生大量分布在各种来源的数据。蛋白质及其编码基因的命名不一致给蛋白质数据整合带来巨大挑战:蛋白质及其编码基因通常具有多个相关的名称和符号,很难绝对匹配;基因和蛋白质的命名法很复杂,并且因物种而异;一些研究较少的物种没有基因和蛋白质的命名法;同一蛋白质/基因的注释在不同的数据库中差异很大。总之,相关研究需要一套全面的蛋白质/基因同义词。
结果:在这项研究中,提出了一种基于蛋白质本体论的蛋白质及其编码基因同义词整合方法。蛋白质和基因同义词集成的工作流程由三个模块组成:数据采集、实体和属性对齐,属性集成和重复数据删除。最后,蛋白质及其编码基因的整合同义词集包含超过1.2859亿个术语,涵盖560,275个蛋白质/基因和13,781个物种。作为语义基础,综合同义词集用于开发数据平台,以提供一站式数据检索,而无需考虑蛋白质命名法和物种的多样性。
结论:这里构造的同义词集可以作为生物命名实体识别的重要资源,没有名称歧义的文本挖掘和信息检索,特别是与明确定义的物种类别相关的同义词可以帮助在分子水平上研究物种之间的进化关系。更重要的是,综合同义词集是我们后续研究蛋白质-蛋白质相互作用(PPI)知识图的语义基础。
BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene
synonyms is necessary for relevant studies.
RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species.
CONCLUSIONS: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially
synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive
synonyms set is the semantic basis for our subsequent studies on Protein-protein Interaction (PPI) knowledge graph.