关键词: XGBoost binary classification deep learning kyoto encyclopedia of gene and genomes (KEGG) machine learning metabolic pathway metabolism metabolite multilayer perceptron supervised learning

来  源:   DOI:10.3390/metabo14050266   PDF(Pubmed)

Abstract:
A major limitation of most metabolomics datasets is the sparsity of pathway annotations for detected metabolites. It is common for less than half of the identified metabolites in these datasets to have a known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a \"pathway category\", as defined by a metabolic knowledge base like KEGG. Past models were implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating the predictions for multiple pathway categories. This past approach multiplied the computational resources necessary for training while diluting the positive entries in the gold standard datasets needed for training. To address these limitations, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts the features both representing a metabolite and representing a pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features pair approach not only outperforms the combined performance of training separate binary classifiers but demonstrates an order of magnitude improvement in robustness: a Matthews correlation coefficient of 0.784 ± 0.013 versus 0.768 ± 0.154.
摘要:
大多数代谢组学数据集的主要限制是检测到的代谢物的途径注释的稀疏性。这些数据集中少于一半的鉴定的代谢物具有已知的代谢途径参与是常见的。试图解决这个限制,已经开发了机器学习模型来预测代谢物与“途径类别”的关联,由像KEGG这样的代谢知识库定义。过去的模型被实现为特定于单个路径类别的单个二进制分类器,需要一组二进制分类器来生成多个路径类别的预测。这种过去的方法增加了训练所需的计算资源,同时稀释了训练所需的黄金标准数据集中的阳性条目。为了解决这些限制,我们提出了使用单个二元分类器的代谢途径预测问题的概括,该分类器接受既代表代谢物又代表途径类别的特征,然后预测给定的代谢物是否涉及相应的途径类别。我们证明了这种代谢物-途径特征对方法不仅优于训练单独的二元分类器的组合性能,而且在鲁棒性方面表现出数量级的提高:马修斯相关系数为0.784±0.013对0.768±0.154。
公众号