关键词: disease pathway network functional enrichment gene expression data gene set enrichment analysis pathway enrichment analysis systems biology

Mesh : Benchmarking RNA-Seq

来  源:   DOI:10.1093/bib/bbae069   PDF(Pubmed)

Abstract:
Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.
摘要:
富集分析(EA)是从基因组规模实验中获得功能见解的常用方法。因此,已经开发了大量的EA方法,然而,从以前的研究中还不清楚哪种方法对于给定的数据集来说是最好的。以前的基准测试的主要问题包括将真实路径正确分配给测试数据集的复杂性,缺乏评价指标的一般性,通常使用单个目标途径的等级。我们在这里提供了一个广义的EA基准,并将其应用于最广泛使用的EA方法,代表当前方法的所有四类。该基准使用了来自26种疾病的DNA微阵列和RNA-Seq实验的82个精选基因表达数据集,其中只有13种是癌症。为了解决单一目标途径方法的缺点,增强敏感性评价,我们提出了疾病路径网络,其中相关的京都基因百科全书和基因组途径是相关的。我们介绍了一种通过结合灵敏度和特异性来评估途径EA的新方法,以提供EA方法的平衡评估。与基于重叠的方法相比,这种方法将网络富集分析方法确定为整体表现最好的方法。通过使用随机基因表达数据集,我们探讨了每种方法的零假设偏差,揭示了它们中的大多数产生偏斜的P值。
公众号