背景:基于基因组数据的机器学习工具有望用于对食源性细菌(如单核细胞增生李斯特菌)进行来源归属的实时监测活动。鉴于机器学习实践的异质性,我们的目的是确定那些影响通常的保持方法与重复k折交叉验证方法的源预测性能的因素.
方法:根据几个基因组指标建立了大量已知来源的1.100个单核细胞增生李斯特菌基因组,以确保基因组图谱的真实性和完整性。基于这些基因组概况(即7个基因座等位基因,核心等位基因,辅助基因,核心SNP和pankmers),我们开发了一个多功能的工作流程,评估训练数据集拆分的不同组合的预测性能(即50、60、70、80和90%),数据预处理(即有或没有接近零的方差去除),和学习模型(即BLR,ERT,射频,SGB,SVM和XGB)。性能指标包括准确性,科恩的卡帕,F1分数,接收器工作特性曲线的曲线下面积,精度召回曲线或精度召回增益曲线,和执行时间。
结果:来自辅助基因和pankmers的测试平均准确度明显高于来自核心等位基因或SNP的准确度。虽然70%和80%的训练数据集拆分的准确性没有显著差异,来自80%的比例显着高于其他测试比例。接近零的方差去除不允许产生7个基因座等位基因的结果,没有显著影响核心等位基因的准确性,辅助基因和pankmers,并显著降低核心SNP的准确性。SVM和XGB模型彼此之间的准确性没有显着差异,并且比BLR达到了更高的准确性。SGB,ERT和RF,在这个数量级上。然而,SVM模型比XGB模型需要更多的计算能力,特别是对于大量的描述符,如核心SNP和pankmers。
结论:除了关于基于基因组数据的单核细胞增生李斯特菌来源归因的机器学习实践的建议之外,本研究还提供了一个免费的工作流程来解决其他平衡或不平衡的多类表型来自其他微生物的二进制和分类基因组谱,而无需修改源代码。
BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method.
METHODS: A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen\'s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time.
RESULTS: The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers.
CONCLUSIONS: In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.