BioM2 ：使用组学数据进行表型预测的生物学信息多阶段机器学习。BioM2: biologically informed multi-stage machine learning for phenotype prediction using omics data.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Navigating the complex landscape of high-dimensional omics data with machine learning models presents a significant challenge. The integration of biological domain knowledge into these models has shown promise in creating more meaningful stratifications of predictor variables, leading to algorithms that are both more accurate and generalizable. However, the wider availability of machine learning tools capable of incorporating such biological knowledge remains limited. Addressing this gap, we introduce BioM2, a novel R package designed for biologically informed multistage machine learning. BioM2 uniquely leverages biological information to effectively stratify and aggregate high-dimensional biological data in the context of machine learning. Demonstrating its utility with genome-wide DNA methylation and transcriptome-wide gene expression data, BioM2 has shown to enhance predictive performance, surpassing traditional machine learning models that operate without the integration of biological knowledge. A key feature of BioM2 is its ability to rank predictor variables within biological categories, specifically Gene Ontology pathways. This functionality not only aids in the interpretability of the results but also enables a subsequent modular network analysis of these variables, shedding light on the intricate systems-level biology underpinning the predictive outcome. We have proposed a biologically informed multistage machine learning framework termed BioM2 for phenotype prediction based on omics data. BioM2 has been incorporated into the BioM2 CRAN package (https://cran.r-project.org/web/packages/BioM2/index.html).

摘要：

使用机器学习模型导航高维组学数据的复杂环境提出了重大挑战。将生物领域知识整合到这些模型中，在创建更有意义的预测变量分层方面显示出了希望，导致算法更准确和可推广。然而,能够整合此类生物学知识的机器学习工具的广泛可用性仍然有限。解决这个差距，我们介绍了BioM2，这是一种新颖的R包，专为生物信息多级机器学习而设计。BioM2独特地利用生物信息在机器学习的背景下有效地分层和聚合高维生物数据。通过全基因组DNA甲基化和全转录组基因表达数据证明其实用性，BioM2已显示出增强的预测性能，超越了没有生物知识集成的传统机器学习模型。BioM2的一个关键特征是它能够在生物类别中对预测变量进行排名，特别是基因本体论途径。此功能不仅有助于结果的可解释性，而且还可以对这些变量进行后续的模块化网络分析。揭示了支撑预测结果的复杂系统级生物学。我们已经提出了一种生物学知情的多阶段机器学习框架，称为BioM2，用于基于组学数据的表型预测。BioM2已被纳入BioM2CRAN软件包（https://cran。r-project.org/web/packages/BioM2/index.html).