来源归因传统上涉及将流行病学数据与不同的病原体表征方法相结合,包括7基因多位点序列分型(MLST)或血清分型,然而,这些方法的分辨率有限。相比之下,全基因组测序数据提供了可用于归因算法的全基因组的概述。这里,我们应用随机森林(RF)算法来预测人类临床鼠伤寒沙门氏菌(S.鼠伤寒沙门氏菌)和单相变体(单相鼠伤寒沙门氏菌)分离株。为此,我们利用从1,061个实验室证实的人和动物鼠伤寒沙门氏菌和单相鼠伤寒沙门氏菌分离株获得的核心基因组MLST等位基因中的单核苷酸多态性多样性作为RF模型的输入.该算法用于监督学习,将399只动物鼠伤寒沙门氏菌和单相鼠伤寒沙门氏菌分离株分为八个不同的主要来源类别之一,包括常见的牲畜和宠物动物物种:牛,猪,绵羊,其他哺乳动物(宠物:主要是狗和马),肉鸡,图层,火鸡,和野鸟(野鸡,鹌鹑,和鸽子)。当应用于训练组动物分离物时,模型准确性为0.929和κ0.905,而对于测试集动物分离株,从模型中保留了主要的源类信息,准确度为0.779,kappa为0.700.随后,该模型用于将662例人类临床病例分配到8个主要来源类别中.在数据集中,60/399(15.0%)的动物和141/662(21.3%)的人类分离株与已知的鼠伤寒沙门氏菌确定型(DT)104爆发有关。该模型将141个DT104爆发中的两个与人类分离株正确地归因于确定为DT104爆发起源的主要来源类别。在没有克隆DT104动物分离株的情况下运行的模型产生了很大程度上一致的输出(训练集准确性0.989和κ0.985;测试集准确性0.781和κ0.663)。总的来说,我们的研究结果表明,RF作为食源性病原体流行病学追踪和来源归因的合适方法提供了相当大的前景.
Source attribution has traditionally involved combining epidemiological data with different pathogen characterisation methods, including 7-gene multi locus sequence typing (MLST) or serotyping, however, these approaches have limited resolution. In contrast, whole genome sequencing data provide an overview of the whole genome that can be used by attribution algorithms. Here, we applied a random forest (RF) algorithm to predict the primary sources of human clinical Salmonella Typhimurium (S. Typhimurium) and monophasic variants (monophasic S. Typhimurium) isolates. To this end, we utilised single nucleotide polymorphism diversity in the core genome MLST alleles obtained from 1,061 laboratory-confirmed human and animal S. Typhimurium and monophasic S. Typhimurium isolates as inputs into a RF model. The algorithm was used for supervised learning to classify 399 animal S. Typhimurium and monophasic S. Typhimurium isolates into one of eight distinct primary source classes comprising common livestock and pet animal species: cattle, pigs, sheep, other mammals (pets: mostly dogs and horses), broilers, layers, turkeys, and game birds (pheasants, quail, and pigeons). When applied to the training set animal isolates, model accuracy was 0.929 and kappa 0.905, whereas for the test set animal isolates, for which the primary source class information was withheld from the model, the accuracy was 0.779 and kappa 0.700. Subsequently, the model was applied to assign 662 human clinical cases to the eight primary source classes. In the dataset, 60/399 (15.0%) of the animal and 141/662 (21.3%) of the human isolates were associated with a known outbreak of S. Typhimurium definitive type (DT) 104. All but two of the 141 DT104 outbreak linked human isolates were correctly attributed by the model to the primary source classes identified as the origin of the DT104 outbreak. A model that was run without the clonal DT104 animal isolates produced largely congruent outputs (training set accuracy 0.989 and kappa 0.985; test set accuracy 0.781 and kappa 0.663). Overall, our results show that RF offers considerable promise as a suitable methodology for epidemiological tracking and source attribution for foodborne pathogens.