关键词: Accessory genome Comparative genomics Core gene Genome plasticity Homology MAG Orthology Pangenome

Mesh : Phylogeny Reproducibility of Results Uncertainty Genome Size Cluster Analysis

来  源:   DOI:10.1186/s13059-023-03089-3   PDF(Pubmed)

Abstract:
A key step for comparative genomics is to group open reading frames into functionally and evolutionarily meaningful gene clusters. Gene clustering is complicated by intraspecific duplications and horizontal gene transfers that are frequent in prokaryotes. In consequence, gene clustering methods must deal with a trade-off between identifying vertically transmitted representatives of multicopy gene families, which are recognizable by synteny conservation, and retrieving complete sets of species-level orthologs. We studied the implications of adopting homology, orthology, or synteny conservation as formal criteria for gene clustering by performing comparative analyses of 125 prokaryotic pangenomes.
Clustering criteria affect pangenome functional characterization, core genome inference, and reconstruction of ancestral gene content to different extents. Species-wise estimates of pangenome and core genome sizes change by the same factor when using different clustering criteria, allowing robust cross-species comparisons regardless of the clustering criterion. However, cross-species comparisons of genome plasticity and functional profiles are substantially affected by inconsistencies among clustering criteria. Such inconsistencies are driven not only by mobile genetic elements, but also by genes involved in defense, secondary metabolism, and other accessory functions. In some pangenome features, the variability attributed to methodological inconsistencies can even exceed the effect sizes of ecological and phylogenetic variables.
Choosing an appropriate criterion for gene clustering is critical to conduct unbiased pangenome analyses. We provide practical guidelines to choose the right method depending on the research goals and the quality of genome assemblies, and a benchmarking dataset to assess the robustness and reproducibility of future comparative studies.
摘要:
背景:比较基因组学的关键步骤是将开放阅读框分为功能上和进化上有意义的基因簇。基因聚类由于在原核生物中频繁发生的种内重复和水平基因转移而变得复杂。因此,基因聚类方法必须在识别多拷贝基因家族的垂直传播代表之间进行权衡,可以通过同质性保护来识别,并检索完整的物种级直系同源物。我们研究了采用同源性的含义,矫形学,或通过对125个原核pangenomes进行比较分析,作为基因簇的正式标准。
结果:聚类标准影响pangenome功能表征,核心基因组推断,并在不同程度上重建祖先基因的含量。使用不同的聚类标准时,pangenome和核心基因组大小的物种估计变化相同的因素,无论聚类标准如何,都允许进行稳健的跨物种比较。然而,基因组可塑性和功能谱的跨物种比较受到聚类标准之间不一致的影响。这种不一致不仅是由可移动的遗传因素驱动的,还有参与防御的基因,次生代谢,和其他附件功能。在一些令人惊奇的特征中,归因于方法不一致的可变性甚至可以超过生态和系统发育变量的影响大小。
结论:选择合适的基因聚类标准对于进行无偏全基因组分析至关重要。我们提供实用指南,根据研究目标和基因组组装的质量选择正确的方法,和基准数据集,以评估未来比较研究的稳健性和可重复性。
公众号