alignment-free

无对齐
  • 文章类型: Journal Article
    背景:宏基因组分箱,属于同一基因组的组装重叠群的聚类,是回收宏基因组组装基因组(MAG)的关键步骤。Contigs通过利用基因组上一致的特征联系起来,例如读取覆盖模式。使用来自多个样本的覆盖导致更高质量的MAG;然而,标准管道要求对多个样本进行全面读取对齐,以计算覆盖率,成为关键的计算瓶颈。
    结果:我们呈现仙女(https://github.com/bluenote-1577/fairy),宏基因组分箱的近似覆盖率计算方法。Fairy是一种快速的基于k-mer的无比对方法。对于多样本分箱,仙女可以>250倍的速度比阅读对齐和足够准确的分箱。Fairy与主机和非主机关联数据集上的几个现有binner兼容。使用MetaBAT2,仙女恢复98.5%的MAG,相对于与BWA对齐,其完整性>50%,污染<5%。值得注意的是,与仙女的多样本分箱总是比使用BWA的单样本分箱更好(平均>1.5×更多>50%完整的MAG),同时仍然更快。对于一个公共沉积物宏基因组项目,我们证明,多样本分箱比单样本分箱回收更高质量的阿斯加古细菌MAG,并且仙女的结果与读数比对没有区别。
    结论:Fairy是一种新工具,用于近似且快速地计算用于分箱的多样本覆盖率,解决宏基因组学的计算瓶颈。视频摘要。
    BACKGROUND: Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck.
    RESULTS: We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be > 250 × faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers 98.5 % of MAGs with > 50 % completeness and < 5 % contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( > 1.5 × more > 50 % complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy\'s results are indistinguishable from read alignment.
    CONCLUSIONS: Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics. Video Abstract.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    几乎任何所需组成的DNA序列,长度,并且可以合成功能以改变生物体的生物学,用于从治疗性化合物的生物生产到侵入性害虫控制的目的。然而,尽管提供了许多巨大的好处,由于恶意行为者可能误用或滥用,工程DNA会带来风险,或者他们无意中引入环境。因此,监测生物或环境系统中工程DNA的存在对于常规和及时检测新出现的生物威胁至关重要。以及提高公众对基因技术的接受度。为了解决这个问题,我们开发了Synsor,用于在高通量测序数据中识别工程化DNA序列的工具。Synsor利用天然存在的和工程化的DNA序列之间的k-mer特征差异,并使用人工神经网络对DNA序列是天然的还是工程化的进行分类。通过根据模型查询可疑序列,Synsor可以鉴定可能已经被工程改造的序列。使用天然质粒和工程载体序列,我们表明Synsor识别工程DNA的准确率>99%。我们展示了如何使用Synsor来检测潜在的基因工程生物,并通过分析来自酵母和废水样品的基因组和宏基因组数据来定位工程DNA被引入环境的位置。分别。因此,Synsor是一种强大的工具,可以简化在特征不佳的生物或环境系统中识别工程DNA的过程。从而可以加强对新出现的生物威胁的监测。
    DNA sequences of nearly any desired composition, length, and function can be synthesized to alter the biology of an organism for purposes ranging from the bioproduction of therapeutic compounds to invasive pest control. Yet despite offering many great benefits, engineered DNA poses a risk due to their possible misuse or abuse by malicious actors, or their unintentional introduction into the environment. Monitoring the presence of engineered DNA in biological or environmental systems is therefore crucial for routine and timely detection of emerging biological threats, and for improving public acceptance of genetic technologies. To address this, we developed Synsor, a tool for identifying engineered DNA sequences in high-throughput sequencing data. Synsor leverages the k-mer signature differences between naturally occurring and engineered DNA sequences and uses an artificial neural network to classify whether a DNA sequence is natural or engineered. By querying suspected sequences against the model, Synsor can identify sequences that are likely to have been engineered. Using natural plasmid and engineered vector sequences, we showed that Synsor identifies engineered DNA with >99% accuracy. We demonstrate how Synsor can be used to detect potential genetically engineered organisms and locate where engineered DNA is being introduced into the environment by analysing genomic and metagenomic data from yeast and wastewater samples, respectively. Synsor is therefore a powerful tool that will streamline the process of identifying engineered DNA in poorly characterized biological or environmental systems, thereby allowing for enhanced monitoring of emerging biological threats.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    使用常规方法的系统发育树估计通常需要成对或多序列比对。然而,在长序列如全基因组的情况下,序列比对具有与可扩展性和准确性相关的困难,低序列同一性,并且存在基因组重排。为了解决这些问题,已经提出了无对齐方法。虽然这些方法已经证明了有希望的结果,当在基于比对的方法中被简单地检测到的一个或多个物种的序列中缺少区域时,这些中的许多导致错误。这里,我们提出了一种无需比对的方法,用于检测要估计系统发育的物种序列中的缺失区域。它基于k聚体的计数,可用于筛选出属于一个物种中一个或多个其他物种中缺失的区域的k聚体。我们对包含缺失区域的真实和模拟数据集进行实验,发现它可以成功检测到很大一部分这样的k-mers,并可以改善估计的系统发育。我们的方法可用于基于k-mer的无比对系统发育估计方法中,以筛选出对应于缺失区域的k-mer。
    Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:由于人为错误,具有异构数据类型的大型队列研究中的样本交换(例如,牛津纳米孔技术的组合,太平洋生物科学,Illumina数据,等。)仍然是困扰大规模研究的一个常见问题。目前,所有样品交换检测方法都需要昂贵且不必要的(例如,如果数据仅用于基因组组装)比对,位置排序,和索引数据,以便类似地进行比较。随着研究包括更多的样本和新的测序数据类型,强大的质量控制工具将变得越来越重要。
    结果:可以使用索引的k聚体序列变体来确定样品之间的相似性。为了提高统计能力,我们使用变体网站上的覆盖率信息,使用基于似然比的测试计算相似性。每个样本的错误率,和覆盖偏差(即,缺失的网站)也可以用这些信息来估计,可用于确定是否可以使用基于空间索引主成分分析(PCA)的预筛选方法,这可以通过防止详尽的全面比较来大大加快分析速度。
    结论:因为该工具处理原始数据,比对齐快,并且可以用于非常低覆盖率的数据,它可以在标准质量控制(QC)管道中节省大量的计算资源。它足够强大,可以用于不同的测序数据类型,在利用不同测序技术优势的研究中很重要。除了样品交换检测的主要用例之外,这种方法还提供了在质量控制中有用的信息,如错误率和覆盖偏差,以及人口级PCA祖先分析可视化。
    Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.
    The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.
    Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    染色体融合是结构变异的重要形式,但是对其识别算法的研究一直很有限。大多数现有的方法都依赖于同义分析,这需要手动注释,并且总是涉及低效的序列比对。在本文中,我们提出了一种新的染色体融合识别算法。我们的方法使用自然向量将问题转换为一系列分配问题,并使用Kuhn-Munkres算法有效地解决了这些问题。当应用于人类/大猩猩和沼泽水牛/河水牛数据集时,我们的算法成功和有效地识别染色体融合事件。值得注意的是,我们的方法提供了几个优点,通过消除耗时的对齐和消除对手动注释的需要,包括更高的处理速度。通过无对齐的视角,我们的算法最初考虑整个染色体而不是片段来识别染色体结构变异,为推进这一领域的研究提供了巨大的潜力。
    Chromosomal fusion is a significant form of structural variation, but research into algorithms for its identification has been limited. Most existing methods rely on synteny analysis, which necessitates manual annotations and always involves inefficient sequence alignments. In this paper, we present a novel alignment-free algorithm for chromosomal fusion recognition. Our method transforms the problem into a series of assignment problems using natural vectors and efficiently solves them with the Kuhn-Munkres algorithm. When applied to the human/gorilla and swamp buffalo/river buffalo datasets, our algorithm successfully and efficiently identifies chromosomal fusion events. Notably, our approach offers several advantages, including higher processing speeds by eliminating time-consuming alignments and removing the need for manual annotations. By an alignment-free perspective, our algorithm initially considers entire chromosomes instead of fragments to identify chromosomal structural variations, offering substantial potential to advance research in this field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们回顾了当前用于文本复杂性估计(信息和熵度量)的方法和生物信息学工具。搜索具有极端统计特征的DNA区域,例如低复杂度区域,对于基因组规模的染色体功能和基因转录调控的生物物理模型很重要。我们讨论了基因组序列分割和描绘的复杂性分析,寻找基因组重复和转座因子,以及对下一代测序读数的应用。我们回顾了复杂性方法和新的应用领域:突变热点基因座分析,用质量控制分析短测序读数,和无比对的基因组比较。在基因组测序时代之前,已经开发了实现文本复杂性估计的各种数值度量(包括组合和语言度量)的算法。估计序列复杂度的一系列工具使用压缩方法,主要通过修改Lempel-Ziv压缩。大多数工具都可以在线获得,为全基因组分析提供大规模服务。用于完整基因组序列分类的新型机器学习应用还包括序列压缩和复杂性算法。我们对不同序列集上的复杂度方法进行了比较,基因转录调控区分析的应用。此外,我们讨论了蛋白质序列复杂性的方法和应用。氨基酸序列的复杂性度量可以通过相同的基于熵和压缩的算法来计算。但是,蛋白质中低复杂度区域的功能和进化作用具有不同于DNA的特定特征。蛋白质序列复杂性的工具旨在蛋白质结构约束。研究表明,蛋白质序列中的低复杂度区域在进化中是保守的,具有重要的生物学和结构功能。最后,我们总结了大规模基因组复杂性比较和冠状病毒基因组分析应用的最新发现。
    We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:已对图拉斯内拉物种的划界进行了广泛研究,主要在形态(性和无性状态)和分子水平-显示它们之间的歧义。一种综合物种概念,包括分子等特征,生态学,形态学,和其他信息对于Tulasnella等复杂群体的物种划界至关重要。
    目的:本研究的目的是使用基于对齐和无对齐的距离矩阵的组合作为传统方法的替代分子工具来测试进化关系,并考虑来自ITS2(内部转录间隔区)序列的二级结构和CBC,用于图拉斯内拉的物种定界。
    方法:绘制了三种系统发育方法:(i)基于比对,(ii)无对齐,和(iii)使用来自R包的DISATIS和pvclust库的两个距离矩阵的组合。最后,二级结构一致性由Mfold建模,并使用4Sale进行了CBC分析,以补充物种划界。
    结论:系统发育树结果显示图拉斯纳氏菌属的单系进化枝。,其中将所有142个图拉斯内拉序列分为两个主要进化枝A和B,并分配给七个物种(T。不对称,T.Andina,T.eichlerianaECU6,T.eichlerianaECU4T.pinicola,T.小提琴),从72%到100%的引导值支持。从2D二级结构对齐,获得了三种具有螺旋和环的共识模型。因此,T.albida属于I型;T.eichleriana,T.流带,T.violea属于II型;T.不对称,T.Andina,T.pinicola,还有T.spp.(GER)属于III型;每个类型包含四到六个域,其中有9个CBCs证实了不同的物种。
    BACKGROUND: The delimitation of species of Tulasnella has been extensively studied, mainly at the morphological (sexual and asexual states) and molecular levels-showing ambiguity between them. An integrative species concept that includes characteristics such as molecular, ecology, morphology, and other information is crucial for species delimitation in complex groups such as Tulasnella.
    OBJECTIVE: The aim of this study is to test evolutionary relationships using a combination of alignment-based and alignment-free distance matrices as an alternative molecular tool to traditional methods, and to consider the secondary structures and CBCs from ITS2 (internal transcribed spacer) sequences for species delimitation in Tulasnella.
    METHODS: Three phylogenetic approaches were plotted: (i) alignment-based, (ii) alignment-free, and (iii) a combination of both distance matrices using the DISTATIS and pvclust libraries from an R package. Finally, the secondary structure consensus was modeled by Mfold, and a CBC analysis was obtained to complement the species delimitation using 4Sale.
    CONCLUSIONS: The phylogenetic tree results showed delimited monophyletic clades in Tulasnella spp., where all 142 Tulasnella sequences were divided into two main clades A and B and assigned to seven species (T. asymmetrica, T. andina, T. eichleriana ECU6, T. eichleriana ECU4 T. pinicola, T. violea), supported by bootstrap values from 72% to 100%. From the 2D secondary structure alignment, three types of consensus models with helices and loops were obtained. Thus, T. albida belongs to type I; T. eichleriana, T. tomaculum, and T. violea belong to type II; and T. asymmetrica, T. andina, T. pinicola, and T. spp. (GER) belong to type III; each type contains four to six domains, with nine CBCs among these that corroborate different species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于比对的RNA-seq定量方法通常涉及在估计转录物丰度之前的耗时比对过程。相比之下,无比对RNA-seq定量方法绕过了这一步,显著提高了速度。现有的无比对方法依赖于期望最大化(EM)算法来估计转录本丰度。然而,EM算法只保证局部最优解,通过找到全局最优解决方案,为进一步提高准确性留下了空间。在这项研究中,我们介绍TQSLE,第一种无比对的RNA-seq定量方法,为转录本丰度估计提供了全球最佳解决方案。TQSLE采用两步法:首先,它构建了参考转录组的k聚体频率矩阵A和RNA-seq读段的k聚体频率向量b;然后,它通过求解线性方程ATAx=ATb直接估计转录本丰度。我们使用模拟和真实RNA-seq数据集评估了TQSLE的性能,并观察到,尽管速度与其他无对齐方法相当,TQSLE在准确性方面优于它们。TQSLE可在https://github.com/yhg926/TQSLE免费获得。
    Alignment-based RNA-seq quantification methods typically involve a time-consuming alignment process prior to estimating transcript abundances. In contrast, alignment-free RNA-seq quantification methods bypass this step, resulting in significant speed improvements. Existing alignment-free methods rely on the Expectation-Maximization (EM) algorithm for estimating transcript abundances. However, EM algorithms only guarantee locally optimal solutions, leaving room for further accuracy improvement by finding a globally optimal solution. In this study, we present TQSLE, the first alignment-free RNA-seq quantification method that provides a globally optimal solution for transcript abundances estimation. TQSLE adopts a two-step approach: first, it constructs a k-mer frequency matrix A for the reference transcriptome and a k-mer frequency vector b for the RNA-seq reads; then, it directly estimates transcript abundances by solving the linear equation ATAx = ATb. We evaluated the performance of TQSLE using simulated and real RNA-seq data sets and observed that, despite comparable speed to other alignment-free methods, TQSLE outperforms them in terms of accuracy. TQSLE is freely available at https://github.com/yhg926/TQSLE.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:分子系统发育学通过生物序列研究种群个体之间的进化关系。它可以提供有关病毒性疾病的起源和演变的见解,或者突出复杂的进化轨迹。一项关键任务是从任何类型的测序数据推断系统发育树,包括原始短读。然而,一些工具需要预处理的输入数据,例如来自基于从头组装的复杂计算管道或来自针对参考基因组的映射。随着测序技术越来越便宜,这对设计直接对其输出进行分析的方法施加了越来越大的压力。从这个角度来看,人们对对齐越来越感兴趣-,集合-,和无参考的方法,可以处理几个数据,包括原始读取数据。
    结果:我们介绍了phyBWT2,phyBWT的新改进版本(Guerrini等人。第22届国际生物信息学算法研讨会(WABI)242:23-12319,2022)。它们都直接重建系统发育树,从而绕过了与参考基因组的比对和从头组装。他们利用扩展的Burrows-Wheeler变换(eBWT)和相应的eBWT位置聚类框架的组合属性来检测不同长度的最长共享子串的相关块(与需要固定长度k的基于k聚体的方法不同先验)。因此,它们提供了新颖的对齐-,集合-,和无参考方法构建分区树,而不依赖于序列的成对比较,从而避免使用距离矩阵来推断系统发育。此外,phyBWT2在运行时间方面优于phyBWT,前者通过考虑多个分区逐步重建系统发育树,而不是一次只有一个分区,正如后者以前所做的那样。
    结论:根据测序数据的实验结果,我们得出的结论是,我们的方法可以通过处理不同类型的数据集来生产质量与基准系统发育相当的树木(短读数,重叠群,或整个基因组)。总的来说,实验证实了phyBWT2的有效性,提高了其先前版本phyBWT的性能,同时保持结果的准确性。
    BACKGROUND: Molecular phylogenetics studies the evolutionary relationships among the individuals of a population through their biological sequences. It may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. A key task is inferring phylogenetic trees from any type of sequencing data, including raw short reads. Yet, several tools require pre-processed input data e.g. from complex computational pipelines based on de novo assembly or from mappings against a reference genome. As sequencing technologies keep becoming cheaper, this puts increasing pressure on designing methods that perform analysis directly on their outputs. From this viewpoint, there is a growing interest in alignment-, assembly-, and reference-free methods that could work on several data including raw reads data.
    RESULTS: We present phyBWT2, a newly improved version of phyBWT (Guerrini et al. in 22nd International Workshop on Algorithms in Bioinformatics (WABI) 242:23-12319, 2022). Both of them directly reconstruct phylogenetic trees bypassing both the alignment against a reference genome and de novo assembly. They exploit the combinatorial properties of the extended Burrows-Wheeler Transform (eBWT) and the corresponding eBWT positional clustering framework to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori). As a result, they provide novel alignment-, assembly-, and reference-free methods that build partition trees without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. In addition, phyBWT2 outperforms phyBWT in terms of running time, as the former reconstructs phylogenetic trees step-by-step by considering multiple partitions, instead of just one partition at a time, as previously done by the latter.
    CONCLUSIONS: Based on the results of the experiments on sequencing data, we conclude that our method can produce trees of quality comparable to the benchmark phylogeny by handling datasets of different types (short reads, contigs, or entire genomes). Overall, the experiments confirm the effectiveness of phyBWT2 that improves the performance of its previous version phyBWT, while preserving the accuracy of the results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在基因组测序的时代,全基因组数据是容易和频繁产生的,导致大量的新信息,可用于推进各个领域的研究。新方法,例如利用基于k-mer的距离评分的无比对系统发育方法,由于它们能够从全基因组数据中快速生成系统发育信息,因此变得越来越受欢迎。然而,这些方法尚未使用环境数据进行测试,这往往是高度分散和不完整的。在这里,我们将一种无比对方法(利用D2统计量)的结果与具有高质量基因组数据的三个藻类组中的传统多基因最大似然树进行了比较。此外,我们模拟低质量,使用这些藻类片段化的基因组数据来测试方法对基因组质量和完整性的鲁棒性。最后,我们将无比对方法应用于环境宏基因组组装的未分类的糖杆菌和树藻的基因组数据,和来自未培养的海洋stramenopiles的单细胞扩增数据,以证明其与真实数据集的实用性。我们发现在所有情况下,无对齐方法产生可比较的系统发育,通常信息更丰富,而不是使用传统的多基因方法创造的。即使存在大量缺失数据,基于k聚体的方法也表现良好,包括传统上用于树重建的标记基因。我们的结果证明了无比对方法对新型分类的价值,通常是神秘或罕见的,物种,可能无法培养或难以使用单细胞方法访问,而是填补生命之树的重要空白。
    In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号