关键词: EBPR machine learning sample size assessment single-cell Raman microspectroscopy single-cell technology

Mesh : Biological Products Humans Machine Learning Phosphorus / chemistry Polyphosphates Sewage Spectrum Analysis, Raman

来  源:   DOI:10.1021/acs.est.1c08768

Abstract:
Rapid progress in various advanced analytical methods, such as single-cell technologies, enable unprecedented and deeper understanding of microbial ecology beyond the resolution of conventional approaches. A major application challenge exists in the determination of sufficient sample size without sufficient prior knowledge of the community complexity and, the need to balance between statistical power and limited time or resources. This hinders the desired standardization and wider application of these technologies. Here, we proposed, tested and validated a computational sampling size assessment protocol taking advantage of a metric, named kernel divergence. This metric has two advantages: First, it directly compares data set-wise distributional differences with no requirements on human intervention or prior knowledge-based preclassification. Second, minimal assumptions in distribution and sample space are made in data processing to enhance its application domain. This enables test-verified appropriate handling of data sets with both linear and nonlinear relationships. The model was then validated in a case study with Single-cell Raman Spectroscopy (SCRS) phenotyping data sets from eight different enhanced biological phosphorus removal (EBPR) activated sludge communities located across North America. The model allows the determination of sufficient sampling size for any targeted or customized information capture capacity or resolution level. Promised by its flexibility and minimal restriction of input data types, the proposed method is expected to be a standardized approach for sampling size optimization, enabling more comparable and reproducible experiments and analysis on complex environmental samples. Finally, these advantages enable the extension of the capability to other single-cell technologies or environmental applications with data sets exhibiting continuous features.
摘要:
各种先进分析方法进展迅速,例如单细胞技术,超越传统方法的分辨率,实现对微生物生态学的前所未有的更深入的理解。在没有足够的社区复杂性的先验知识的情况下,确定足够的样本量存在一个主要的应用挑战,需要在统计能力和有限的时间或资源之间取得平衡。这阻碍了这些技术的期望标准化和更广泛的应用。这里,我们提议,测试和验证了利用度量的计算采样大小评估协议,命名为内核发散。这个指标有两个优点:第一,它直接比较数据集分布差异,不需要人工干预或基于先验知识的预分类。第二,在数据处理中对分布和样本空间进行了最小假设,以增强其应用领域。这使得能够对具有线性和非线性关系的数据集进行测试验证的适当处理。然后在案例研究中使用来自北美八个不同的增强生物除磷(EBPR)活性污泥群落的单细胞拉曼光谱(SCRS)表型数据集对模型进行了验证。该模型允许为任何目标或定制的信息捕获能力或分辨率水平确定足够的采样大小。其灵活性和对输入数据类型的最小限制的承诺,所提出的方法有望成为抽样规模优化的标准化方法,能够对复杂的环境样品进行更具可比性和可重复性的实验和分析。最后,这些优点使得能够将能力扩展到其他单细胞技术或具有显示连续特征的数据集的环境应用。
公众号