关键词: NLP gap closing gap filling next-generation sequencing

Mesh : Neural Networks, Computer Algorithms Deep Learning Genome, Fungal Saccharomyces cerevisiae / genetics Schizosaccharomyces / genetics High-Throughput Nucleotide Sequencing / methods Neurospora crassa / genetics Software Genomics / methods Sequence Analysis, DNA / methods

来  源:   DOI:10.3390/ijms25158502   PDF(Pubmed)

Abstract:
With the widespread adoption of next-generation sequencing technologies, the speed and convenience of genome sequencing have significantly improved, and many biological genomes have been sequenced. However, during the assembly of small genomes, we still face a series of challenges, including repetitive fragments, inverted repeats, low sequencing coverage, and the limitations of sequencing technologies. These challenges lead to unknown gaps in small genomes, hindering complete genome assembly. Although there are many existing assembly software options, they do not fully utilize the potential of artificial intelligence technologies, resulting in limited improvement in gap filling. Here, we propose a novel method, DLGapCloser, based on deep learning, aimed at assisting traditional tools in further filling gaps in small genomes. Firstly, we created four datasets based on the original genomes of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Neurospora crassa, and Micromonas pusilla. To further extract effective information from the gene sequences, we also added homologous genomes to enrich the datasets. Secondly, we proposed the DGCNet model, which effectively extracts features and learns context from sequences flanking gaps. Addressing issues with early pruning and high memory usage in the Beam Search algorithm, we developed a new prediction algorithm, Wave-Beam Search. This algorithm alternates between expansion and contraction phases, enhancing efficiency and accuracy. Experimental results showed that the Wave-Beam Search algorithm improved the gap-filling performance of assembly tools by 7.35%, 28.57%, 42.85%, and 8.33% on the original results. Finally, we established new gap-filling standards and created and implemented a novel evaluation method. Validation on the genomes of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Neurospora crassa, and Micromonas pusilla showed that DLGapCloser increased the number of filled gaps by 8.05%, 15.3%, 1.4%, and 7% compared to traditional assembly tools.
摘要:
随着下一代测序技术的广泛采用,基因组测序的速度和便利性显著提高,许多生物基因组已经被测序。然而,在小基因组的组装过程中,我们仍然面临一系列挑战,包括重复的片段,反向重复,低测序覆盖率,以及测序技术的局限性。这些挑战导致小基因组中未知的差距,阻碍完整的基因组组装。尽管有许多现有的装配软件选项,他们没有充分利用人工智能技术的潜力,导致缺口填充的改善有限。这里,我们提出了一种新的方法,DLGapCloser,基于深度学习,旨在帮助传统工具进一步填补小基因组的空白。首先,我们根据酿酒酵母的原始基因组创建了四个数据集,蓬布裂殖酵母,粗糙神经孢子菌,和Micromonaspusilla.为了进一步从基因序列中提取有效信息,我们还添加了同源基因组来丰富数据集。其次,我们提出了DGCNet模型,它有效地提取特征并从间隙侧翼的序列中学习上下文。解决Beam搜索算法中早期修剪和高内存使用的问题,我们开发了一种新的预测算法,波浪梁搜索。该算法在膨胀和收缩阶段之间交替,提高效率和准确性。实验结果表明,Wave-Beam搜索算法使装配工具的间隙填充性能提高了7.35%,28.57%,42.85%,和原始结果的8.33%。最后,我们建立了新的填补空白标准,创建并实施了一种新的评价方法.酿酒酵母基因组的验证,蓬布裂殖酵母,粗糙神经孢子菌,和Micromonaspusilla显示DLGapCloser增加了8.05%的填补缺口的数量,15.3%,1.4%,与传统装配工具相比,为7%。
公众号