■测序技术的发展增加了被测序的基因组的数量。然而,通过在存在重复序列(重复)的情况下组装大量的短字符串(读段),获得高质量的基因组序列仍然是基因组组装中的挑战。用于基因组组装的计算机算法以两种方法从读段构建整个基因组。从头方法基于它们的后缀前缀(重叠)之间的精确匹配来连接读段。参考指导的方法基于它们在众所周知的参考基因组中的偏移对读段进行排序(读段比对)。重复的存在扩展了技术上的歧义,使得算法无法区分读段,从而导致误组装,影响组装方法的准确性。另一方面,大量的读取导致了一个大的组装性能挑战。
■通过预先鉴定重复序列,将重复鉴定方法引入错误组装,创建重复知识库以减少装配过程中的歧义,从而提高了组装基因组的准确性。此外,在参考基因组的帮助下,组装方法之间的杂交导致较低的误组装程度。通过数据结构索引和并行化来优化装配性能。本文的主要目的和贡献是通过广泛的综述来支持研究人员,以简化其他研究人员对基因组组装研究的搜索。这项研究还,重点介绍了基因组组装准确性和性能优化方面的最新进展和局限性。
■我们的发现表明了可用的重复识别方法的局限性,只允许检测重复的特定长度,当基因组中存在各种类型的重复时,可能表现不佳。我们还发现,大多数混合组装方法,无论是从头开始还是参考指导,在处理重复序列方面有一些限制,因为它在计算上更昂贵且时间密集。尽管发现混合方法优于单独的组装方法,优化其性能仍然是一个挑战。此外,在基因组组装的重叠和读段比对中并行化的使用尚未在混合组装方法中完全实现。
■我们建议将多种重复识别方法结合起来,以提高识别重复的准确性,作为混合组装方法的初始步骤,并将基因组索引与并行化结合起来,以更好地优化其性能。
UNASSIGNED: The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge.
UNASSIGNED: The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article\'s primary aim and contribution are to support the researchers through an extensive
review to ease other researchers\' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization.
UNASSIGNED: Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach.
UNASSIGNED: We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.