关键词: Alignment-free Missing regions Phylogeny k-mer

来  源:   DOI:10.1016/j.heliyon.2024.e32227   PDF(Pubmed)

Abstract:
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
摘要:
使用常规方法的系统发育树估计通常需要成对或多序列比对。然而,在长序列如全基因组的情况下,序列比对具有与可扩展性和准确性相关的困难,低序列同一性,并且存在基因组重排。为了解决这些问题,已经提出了无对齐方法。虽然这些方法已经证明了有希望的结果,当在基于比对的方法中被简单地检测到的一个或多个物种的序列中缺少区域时,这些中的许多导致错误。这里,我们提出了一种无需比对的方法,用于检测要估计系统发育的物种序列中的缺失区域。它基于k聚体的计数,可用于筛选出属于一个物种中一个或多个其他物种中缺失的区域的k聚体。我们对包含缺失区域的真实和模拟数据集进行实验,发现它可以成功检测到很大一部分这样的k-mers,并可以改善估计的系统发育。我们的方法可用于基于k-mer的无比对系统发育估计方法中,以筛选出对应于缺失区域的k-mer。
公众号