关键词: chromatin computational genomics embeddings functional genomics genomic intervals information retrieval metadata representation learning search

来  源:   DOI:10.3390/bioengineering11030263   PDF(Pubmed)

Abstract:
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
摘要:
随着可用的基因组间隔数据规模的增加,我们需要快速的系统来搜索它们。一种常见的方法是简单的字符串匹配,将搜索词与元数据进行比较,但这受限于不完整或不准确的注释。另一种方法是通过基因组区域重叠分析直接比较数据,但是这种方法会带来像稀疏这样的挑战,高维,和计算费用。我们需要新颖的方法来快速灵活地查询大型,凌乱的基因组间隔数据库。这里,我们使用表征学习开发了一个基因组间隔搜索系统。我们同时训练一组区域集及其元数据标签的数值嵌入,在低维空间中捕获区域集及其元数据之间的相似性。使用这些学习的共同嵌入,我们开发了一个系统,该系统使用嵌入距离计算来解决三个相关的信息检索任务:检索与用户查询字符串相关的区域集,建议数据库区域集的新标签,和检索类似于查询区域集的数据库区域集。我们评估了这些用例,并表明区域集和元数据的联合学习表示是一种有前途的方法,灵活,和准确的基因组区域信息检索。
公众号