由于古代DNA的技术进步,现在可以对过去的病毒进行测序,以追踪它们的起源和进化。然而,与现代数据相比,古代DNA数据的降解和污染程度要高得多,这使得古代病毒基因组的鉴定尤其具有挑战性。几种表征现代微生物组的方法(以及,在这里面,病毒)已经开发出来;特别是,将测序的读数分配给特定分类群以便表征感兴趣的样品中存在的生物体的工具。虽然这些现有工具通常用于现代数据中,当它们应用于古代微生物组数据以筛选古代病毒时,它们的性能仍然未知。在这项工作中,我们使用公共病毒序列进行了广泛的模拟研究,以确定哪种工具最适合筛选古代样本中的人类DNA病毒。我们比较了四种广泛使用的分类器的性能,即离心机,Kraken2,DIAMOND和MetaPhlAn2,正确地将测序读段分配给相应的病毒。要做到这一点,我们通过将古代DNA的典型噪声添加到一组公开可用的人类DNA病毒序列和人类基因组中来模拟读段。我们将DNA分裂成不同的长度,在读取末端添加测序错误和C至T和G至A脱氨基取代。然后,我们测量了所有分类器的灵敏度和精度。在大多数模拟中,通过离心机回收了233种模拟病毒中的228种以上,Kraken2和钻石,与MetaPhlAn2相比,MetaPhlAn2仅恢复了约三分之一。总的来说,离心机和Kraken2具有最佳性能,灵敏度和精度最高。我们发现,脱氨基损伤对分类器的性能影响不大,小于测序错误和读取长度。由于离心机可以处理短读数(与默认设置的DIAMOND和Kraken2相反),并且由于它在所有执行的模拟中在物种水平上实现了最高的灵敏度和精度,这是我们推荐的工具。不管使用什么工具,我们的模拟表明,对于古代人类研究,用户应该使用严格的过滤器来删除所有潜在人类来源的读取。最后,我们建议用户验证使用的数据库中存在哪些物种,因为默认数据库可能会缺少感兴趣的病毒序列。
Owing to technological advances in ancient DNA, it is now possible to sequence viruses from the past to track down their origin and evolution. However, ancient DNA data is considerably more degraded and contaminated than modern data making the identification of ancient viral genomes particularly challenging. Several methods to characterise the modern microbiome (and, within this, the virome) have been developed; in particular, tools that assign sequenced reads to specific taxa in order to characterise the organisms present in a sample of interest. While these existing tools are routinely used in modern data, their performance when applied to ancient microbiome data to screen for ancient viruses remains unknown. In this work, we conducted an extensive simulation
study using public viral sequences to establish which tool is the most suitable to screen ancient samples for human DNA viruses. We compared the performance of four widely used classifiers, namely Centrifuge, Kraken2, DIAMOND and MetaPhlAn2, in correctly assigning sequencing reads to the corresponding viruses. To do so, we simulated reads by adding noise typical of ancient DNA to a set of publicly available human DNA viral sequences and to the human genome. We fragmented the DNA into different lengths, added sequencing error and C to T and G to A deamination substitutions at the read termini. Then we measured the resulting sensitivity and precision for all classifiers. Across most simulations, more than 228 out of the 233 simulated viruses were recovered by Centrifuge, Kraken2 and DIAMOND, in contrast to MetaPhlAn2 which recovered only around one third. Overall, Centrifuge and Kraken2 had the best performance with the highest values of sensitivity and precision. We found that deamination damage had little impact on the performance of the classifiers, less than the sequencing error and the length of the reads. Since Centrifuge can handle short reads (in contrast to DIAMOND and Kraken2 with default settings) and since it achieve the highest sensitivity and precision at the species level across all the simulations performed, it is our recommended tool. Regardless of the tool used, our simulations indicate that, for ancient human studies, users should use strict filters to remove all reads of potential human origin. Finally, we recommend that users verify which species are present in the database used, as it might happen that default databases lack sequences for viruses of interest.