关键词: Escherichia coli logic regression microbial source tracking source attribution supervised learning

来  源:   DOI:10.1128/aem.00227-24

Abstract:
Microbial source tracking leverages a wide range of approaches designed to trace the origins of fecal contamination in aquatic environments. Although source tracking methods are typically employed within the laboratory setting, computational techniques can be leveraged to advance microbial source tracking methodology. Herein, we present a logic regression-based supervised learning approach for the discovery of source-informative genetic markers within intergenic regions across the Escherichia coli genome that can be used for source tracking. With just single intergenic loci, logic regression was able to identify highly source-specific (i.e., exceeding 97.00%) biomarkers for a wide range of host and niche sources, with sensitivities reaching as high as 30.00%-50.00% for certain source categories, including pig, sheep, mouse, and wastewater, depending on the specific intergenic locus analyzed. Restricting the source range to reflect the most prominent zoonotic sources of E. coli transmission (i.e., bovine, chicken, human, and pig) allowed for the generation of informative biomarkers for all host categories, with specificities of at least 90.00% and sensitivities between 12.50% and 70.00%, using the sequence data from key intergenic regions, including emrKY-evgAS, ibsB-(mdtABCD-baeSR), ompC-rcsDB, and yedS-yedR, that appear to be involved in antibiotic resistance. Remarkably, we were able to use this approach to classify 48 out of 113 river water E. coli isolates collected in Northwestern Sweden as either beaver, human, or reindeer in origin with a high degree of consensus-thus highlighting the potential of logic regression modeling as a novel approach for augmenting current source tracking efforts.IMPORTANCEThe presence of microbial contaminants, particularly from fecal sources, within water poses a serious risk to public health. The health and economic burden of waterborne pathogens can be substantial-as such, the ability to detect and identify the sources of fecal contamination in environmental waters is crucial for the control of waterborne diseases. This can be accomplished through microbial source tracking, which involves the use of various laboratory techniques to trace the origins of microbial pollution in the environment. Building on current source tracking methodology, we describe a novel workflow that uses logic regression, a supervised machine learning method, to discover genetic markers in Escherichia coli, a common fecal indicator bacterium, that can be used for source tracking efforts. Importantly, our research provides an example of how the rise in prominence of machine learning algorithms can be applied to improve upon current microbial source tracking methodology.
摘要:
微生物源追踪利用了多种旨在追踪水生环境中粪便污染起源的方法。尽管源跟踪方法通常在实验室环境中使用,可以利用计算技术来推进微生物源跟踪方法。在这里,我们提出了一种基于逻辑回归的监督学习方法,用于在大肠杆菌基因组的基因间区域内发现源信息遗传标记,可用于源跟踪。只有一个基因间基因座,逻辑回归能够识别高度特定的来源(即,超过97.00%)的生物标志物,用于广泛的宿主和利基来源,某些来源类别的敏感度高达30.00%-50.00%,包括猪,绵羊,鼠标,和废水,取决于分析的特定基因间基因座。限制来源范围,以反映大肠杆菌传播的最突出的人畜共患来源(即,牛,鸡肉,人类,和猪)允许生成所有宿主类别的信息生物标志物,特异性至少为90.00%,敏感性在12.50%至70.00%之间,使用来自关键基因间区域的序列数据,包括emrKY-evgas,ibsB-(mdtABCD-baeSR),ompC-rcsDB,和yedS-yedR,似乎与抗生素耐药性有关。值得注意的是,我们能够使用这种方法将瑞典西北部收集的113种河水大肠杆菌分离物中的48种分类为海狸,人类,或起源的驯鹿具有高度的共识-从而突出了逻辑回归建模作为增强当前源跟踪工作的新颖方法的潜力。重要的是微生物污染物的存在,特别是从粪便来源,在水中对公众健康构成严重威胁。水传播病原体的健康和经济负担可能是巨大的-因此,检测和识别环境水域粪便污染源的能力对于控制水传播疾病至关重要。这可以通过微生物来源追踪来实现,其中涉及使用各种实验室技术来追踪环境中微生物污染的起源。基于当前的源跟踪方法,我们描述了一种使用逻辑回归的新工作流程,一种有监督的机器学习方法,在大肠杆菌中发现遗传标记,一种常见的粪便指示细菌,可用于源跟踪工作。重要的是,我们的研究提供了一个例子,说明如何将机器学习算法的重要性提高到改进当前的微生物源跟踪方法。
公众号