背景:尽管地球上微生物类群的总数仍在争论中,很明显,只有一小部分的这些已被培育和有效命名。显然,无法在非常特殊的条件之外培养大多数细菌严重限制了它们的表征和进一步的研究。在过去的十年里,解决这个问题的主要部分是使用宏基因组测序,对整个微生物群落的DNA进行测序,随后对其新组成物种的基因组进行了计算机模拟重建。测序型菌株基因组的数量(约12,000)和总微生物多样性(106-1012种)之间的巨大差异指导这些努力从头组装和分箱。不幸的是,这些步骤容易出错,因此,必须严格审查结果,以避免发布不完整和低质量的基因组。
结果:我们开发了MAGISTA(宏基因组组装的基因组箱内统计评估),一种评估宏基因组组装基因组质量的新方法,解决了当前基于参考基因的方法经常被忽视的一些缺点。MAGISTA基于宏基因组箱内重叠群片段之间的无比对距离分布,而不是一组参考基因。为了适当的培训,需要一个高度复杂的基因组DNA模拟群落,并通过汇集227个细菌菌株的基因组DNA来构建,专门选择以获得代表可培养细菌的主要系统发育谱系的各种品种。
结论:MAGISTA在公开可用的模拟宏基因组上进行测试时,与标记基因方法相比,均方根误差降低了20%。此外,我们高度复杂的基因组DNA模拟社区是基准(新)宏基因组分析方法的非常有价值的工具。
BACKGROUND: Although the total number of microbial taxa on Earth is under debate, it is clear that only a small fraction of these has been cultivated and validly named. Evidently, the inability to culture most bacteria outside of very specific conditions severely limits their characterization and further studies. In the last decade, a major part of the solution to this problem has been the use of metagenome sequencing, whereby the DNA of an entire microbial community is sequenced, followed by the in silico reconstruction of genomes of its novel component species. The large discrepancy between the number of sequenced type strain genomes (around 12,000) and total microbial diversity (106-1012 species) directs these efforts to de novo assembly and binning. Unfortunately, these steps are error-prone and as such, the results have to be intensely scrutinized to avoid publishing incomplete and low-quality genomes.
RESULTS: We developed MAGISTA (metagenome-assembled genome intra-bin statistics assessment), a novel approach to assess metagenome-assembled genome quality that tackles some of the often-neglected drawbacks of current reference gene-based methods. MAGISTA is based on alignment-free distance distributions between contig fragments within metagenomic bins, rather than a set of reference genes. For proper training, a highly complex genomic DNA mock community was needed and constructed by pooling genomic DNA of 227 bacterial strains, specifically selected to obtain a wide variety representing the major phylogenetic lineages of cultivable bacteria.
CONCLUSIONS: MAGISTA achieved a 20% reduction in root-mean-square error in comparison to the marker gene approach when tested on publicly available mock metagenomes. Furthermore, our highly complex genomic DNA mock community is a very valuable tool for benchmarking (new) metagenome analysis methods.