关键词: benchmark data integration multi-omics data prediction models supervised analysis

Mesh : Humans Computational Biology / methods Algorithms Genomics / methods statistics & numerical data Multiomics

来  源:   DOI:10.1093/bib/bbae331   PDF(Pubmed)

Abstract:
Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple \'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.
摘要:
测序的最新进展,质谱,和细胞计数技术使研究人员能够从单个样本中收集多种组学数据类型。这些庞大的数据集已经导致越来越多的共识,即需要一种整体方法来识别新的候选生物标志物并揭示潜在的疾病病因机制。精准医学的关键.虽然已经对无监督方法进行了许多审查和基准测试,他们的监督同行在文献中受到的关注较少,而且还没有出现金本位制。在这项工作中,我们对六种方法进行了彻底的比较,中间综合方法的主要家族的代表(矩阵分解,多个内核方法,合奏学习,和基于图的方法)。作为非积分控制,对连接和分离的数据类型执行随机森林。方法对模拟和现实数据集的分类性能进行了评估,后者经过精心挑选,以涵盖不同的医疗应用(传染病,肿瘤学,和疫苗)和数据模式。从现实世界的数据集中设计了总共15个仿真场景,以探索一个庞大而真实的参数空间(例如样本量,维度,阶级不平衡,效果大小)。在真实数据上,方法比较表明,整合方法比非整合方法表现更好或同样好。相比之下,在大多数模拟场景中,DIABLO和四个随机森林替代方案的表现优于其他方案。详细讨论了这些方法的优点和局限性,并为将来的应用提供了指导。
公众号