关键词: assessment chromosome number eukaryote genome assembly genome size genomics sequencing

Mesh : Genome Size Chromosomes / genetics Eukaryota / genetics Genomics / methods Algorithms Sequence Analysis, DNA / methods

来  源:   DOI:10.1093/genetics/iyae099

Abstract:
The number of genome assemblies has rapidly increased in recent history, with NCBI databases reaching over 41,000 eukaryotic genome assemblies across about 2,300 species. Increases in read length and improvements in assembly algorithms have led to increased contiguity and larger genome assemblies. While this number of assemblies is impressive, only about a third of these assemblies have corresponding genome size estimations for their respective species on publicly available databases. In this paper, genome assemblies are assessed regarding their total size compared to their respective publicly available genome size estimations. These deviations in size are assessed related to genome size, kingdom, sequencing platform, and standard assembly metrics, such as N50 and BUSCO values. A large proportion of assemblies deviate from their estimated genome size by more than 10%, with increasing deviations in size with increased genome size, suggesting nonprotein coding and structural DNA may be to blame. Modest differences in performance of sequencing platforms are noted as well. While standard metrics of genome assessment are more likely to indicate an assembly approaching the estimated genome size, much of the variation in this deviation in size is not explained with these raw metrics. A new, proportional N50 metric is proposed, in which N50 values are made relative to the average chromosome size of each species. This new metric has a stronger relationship with complete genome assemblies and, due to its proportional nature, allows for a more direct comparison across assemblies for genomes with variation in sizes and architectures.
摘要:
在最近的历史中,基因组组装的数量迅速增加,NCBI数据库覆盖了大约2,300个物种的41,000多个真核基因组组件。读取长度的增加和组装算法的改进已经导致增加的连续性和更大的基因组组装。虽然这么多的组件令人印象深刻,在公开可用的数据库中,只有约三分之一的组件对其各自物种具有相应的基因组大小估计。在本文中,与它们各自公开可用的基因组大小估计相比,评估基因组组装体的总大小.这些大小的偏差与基因组大小有关,王国,测序平台,和标准装配指标,如N50和BUSCO值。很大比例的装配体偏离其估计的基因组大小超过10%,随着基因组大小的增加,大小的偏差越来越大,这表明非蛋白质编码和结构DNA可能是罪魁祸首。还注意到测序平台的性能的适度差异。虽然基因组评估的标准指标更有可能表明装配接近估计的基因组大小,这些原始指标没有解释这种大小偏差的大部分变化。一个新的,提出了比例N50度量(PN50),其中N50值是相对于每个物种的平均染色体大小得出的。这个新指标与完整的基因组组装有更强的关系,由于其比例性质,允许在大小和架构变化的基因组的装配之间进行更直接的比较。
公众号