关键词: bacteria benchmark chromosome genomic features k-mer machine learning plasmid prediction tool random forest shared k-mers

Mesh : Genomics / methods Plasmids / genetics Genome, Bacterial Machine Learning

来  源:   DOI:10.1128/spectrum.04645-22   PDF(Pubmed)

Abstract:
Identification of plasmids in bacterial genomes is critical for many factors, including horizontal gene transfer, antibiotic resistance genes, host-microbe interactions, cloning vectors, and industrial production. There are several in silico methods to predict plasmid sequences in assembled genomes. However, existing methods have evident shortcomings, such as unbalance in sensitivity and specificity, dependency on species-specific models, and performance reduction in sequences shorter than 10 kb, which has limited their scope of applicability. In this work, we proposed Plasmer, a novel plasmid predictor based on machine-learning of shared k-mers and genomic features. Unlike existing k-mer or genomic-feature based methods, Plasmer employs the random forest algorithm to make predictions using the percent of shared k-mers with plasmid and chromosome databases combined with other genomic features, including alignment E value and replicon distribution scores (RDS). Plasmer can predict on multiple species and has achieved an average the area under the curve (AUC) of 0.996 with accuracy of 98.4%. Compared to existing methods, tests of both sliding sequences and simulated and de novo assemblies have consistently shown that Plasmer has outperforming accuracy and stable performance across long and short contigs above 500 bp, demonstrating its applicability for fragmented assemblies. Plasmer also has excellent and balanced performance on both sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which has eliminated the bias on sensitivity or specificity that was common in existing methods. Plasmer also provides taxonomy classification to help identify the origin of plasmids. IMPORTANCE In this study, we proposed a novel plasmid prediction tool named Plasmer. Technically, unlike existing k-mer or genomic features-based methods, Plasmer is the first tool to combine the advantages of the percent of shared k-mers and the alignment score of genomic features. This has given Plasmer (i) evident improvement in performance compared to other methods, with the best F1-score and accuracy on sliding sequences, simulated contigs, and de novo assemblies; (ii) applicability for contigs above 500 bp with highest accuracy, enabling plasmid prediction in fragmented short-read assemblies; (iii) excellent and balanced performance between sensitivity and specificity (both >0.95 above 500 bp) with the highest F1-score, which eliminated the bias on sensitivity or specificity that commonly existed in other methods; and (iv) no dependency of species-specific training models. We believe that Plasmer provides a more reliable alternative for plasmid prediction in bacterial genome assemblies.
摘要:
鉴定细菌基因组中的质粒对许多因素至关重要,包括水平基因转移,抗生素抗性基因,宿主-微生物相互作用,克隆载体,和工业生产。有几种计算机模拟方法来预测组装基因组中的质粒序列。然而,现有方法存在明显的缺点,如敏感性和特异性不平衡,依赖于特定物种的模型,和性能降低序列短于10kb,这限制了它们的适用范围。在这项工作中,我们提出了Plasmer,一种基于共享k-mer和基因组特征的机器学习的新型质粒预测因子。与现有的基于k聚体或基因组特征的方法不同,Plasmer采用随机森林算法,使用共享k聚体与质粒和染色体数据库的百分比结合其他基因组特征进行预测。包括比对E值和复制子分布得分(RDS)。质粒可以预测多种物种,并且曲线下面积(AUC)的平均值为0.996,准确度为98.4%。与现有方法相比,滑动序列和模拟和从头组装的测试一致表明,Plasmer在500bp以上的长和短重叠群中具有优于性能的准确性和稳定性能,证明其适用于零散的组件。Plasmer在灵敏度和特异性(超过500bp均>0.95)方面也具有出色且平衡的性能,具有最高的F1评分,这消除了现有方法中常见的敏感性或特异性偏差。质粒还提供分类学分类以帮助鉴定质粒的起源。在这项研究中的重要性,我们提出了一种新的质粒预测工具Plasmer。从技术上讲,与现有的基于k-mer或基因组特征的方法不同,质粒是将共享k聚体的百分比和基因组特征的比对得分的优势相结合的第一个工具。与其他方法相比,这使得Plasmer(i)在性能上有了明显的改进,在滑动序列上具有最佳的F1分数和准确性,模拟重叠群,和从头组装;(Ii)以最高精度适用于500bp以上的重叠群,能够在片段化的短读段组装中进行质粒预测;(iii)灵敏度和特异性(超过500bp均>0.95)之间的优异且平衡的性能,具有最高的F1评分,这消除了在其他方法中通常存在的敏感性或特异性的偏倚;和(iv)没有物种特异性训练模型的依赖性。我们认为,质粒为细菌基因组组装中的质粒预测提供了更可靠的替代方法。
公众号