关键词: De novo genome assembly Repeat elements Sequence analysis

Mesh : Algorithms Animals Arabidopsis / genetics Base Sequence Birds / genetics Consensus Sequence DNA / chemistry Drosophila melanogaster / genetics Humans Repetitive Sequences, Nucleic Acid Sequence Alignment Sequence Analysis, DNA / methods

来  源:   DOI:10.1186/s12864-018-4920-6   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
BACKGROUND: Repeat elements are important components of most eukaryotic genomes. Most existing tools for repeat analysis rely either on high quality reference genomes or existing repeat libraries. Thus, it is still challenging to do repeat analysis for species with highly repetitive or complex genomes which often do not have good reference genomes or annotated repeat libraries. Recently we developed a computational method called REPdenovo that constructs consensus repeat sequences directly from short sequence reads, which outperforms an existing tool called RepARK. One major issue with REPdenovo is that it doesn\'t perform well for repeats with relatively high divergence rates or low copy numbers. In this paper, we present an improved approach for constructing consensus repeats directly from short reads. Comparing with the original REPdenovo, the improved approach uses more repeat-related k-mers and improves repeat assembly quality using a consensus-based k-mer processing method.
RESULTS: We compare the performance of the new method with REPdenovo and RepARK on Human, Arabidopsis thaliana and Drosophila melanogaster short sequencing data. And the new method fully constructs more repeats in Repbase than the original REPdenovo and RepARK, especially for repeats of higher divergence rates and lower copy number. We also apply our new method on Hummingbird data which doesn\'t have a known repeat library, and it constructs many repeat elements that can be validated using PacBio long reads.
CONCLUSIONS: We propose an improved method for reconstructing repeat elements directly from short sequence reads. The results show that our new method can assemble more complete repeats than REPdenovo (and also RepARK). Our new approach has been implemented as part of the REPdenovo software package, which is available for download at https://github.com/Reedwarbler/REPdenovo .
摘要:
背景:重复元件是大多数真核生物基因组的重要组成部分。大多数现有的重复分析工具依赖于高质量的参考基因组或现有的重复文库。因此,对具有高度重复或复杂基因组的物种进行重复分析仍然具有挑战性,这些基因组通常没有良好的参考基因组或带注释的重复文库。最近,我们开发了一种称为REPdenovo的计算方法,该方法直接从短序列读段构建共有重复序列,它的性能优于名为RepARK的现有工具。REPdenovo的一个主要问题是它对于相对高发散率或低拷贝数的重复表现不佳。在本文中,我们提出了一种改进的方法,可以直接从短读段构建共识重复。与原始的REPdenovo相比,改进的方法使用更多重复相关的k-mer,并使用基于共识的k-mer处理方法提高重复装配质量.
结果:我们将新方法与REPdenovo和RepARK在Human上的性能进行了比较,拟南芥和果蝇短测序数据。与原始的REPdenovo和RepARK相比,新方法在Repbase中完全构建了更多的重复序列,特别是对于较高发散率和较低拷贝数的重复。我们还将我们的新方法应用于没有已知重复库的蜂鸟数据,它构造了许多可以使用PacBio长读取进行验证的重复元素。
结论:我们提出了一种直接从短序列读段重建重复元件的改进方法。结果表明,我们的新方法可以比REPdenovo(以及RepARK)组装更完整的重复序列。我们的新方法已作为REPdenovo软件包的一部分实施,可以在https://github.com/Reedwarbler/REPdenovo上下载。
公众号