关键词: RNA-seq benchmark data set differential analysis

Mesh : Arabidopsis / genetics Arabidopsis Proteins / genetics Computer Simulation Datasets as Topic Gene Expression Profiling / methods High-Throughput Nucleotide Sequencing / methods Humans Models, Statistical RNA / genetics Sequence Analysis, RNA / methods Software Transcriptome

来  源:   DOI:10.1093/bib/bbw092

Abstract:
Numerous statistical pipelines are now available for the differential analysis of gene expression measured with RNA-sequencing technology. Most of them are based on similar statistical frameworks after normalization, differing primarily in the choice of data distribution, mean and variance estimation strategy and data filtering. We propose an evaluation of the impact of these choices when few biological replicates are available through the use of synthetic data sets. This framework is based on real data sets and allows the exploration of various scenarios differing in the proportion of non-differentially expressed genes. Hence, it provides an evaluation of the key ingredients of the differential analysis, free of the biases associated with the simulation of data using parametric models. Our results show the relevance of a proper modeling of the mean by using linear or generalized linear modeling. Once the mean is properly modeled, the impact of the other parameters on the performance of the test is much less important. Finally, we propose to use the simple visualization of the raw P-value histogram as a practical evaluation criterion of the performance of differential analysis methods on real data sets.
摘要:
现在有许多统计管道可用于用RNA测序技术测量的基因表达的差异分析。它们中的大多数都是基于归一化后的类似统计框架,不同之处主要在于数据分布的选择,均值和方差估计策略和数据过滤。当通过使用合成数据集几乎没有生物重复时,我们建议评估这些选择的影响。该框架基于真实数据集,并允许探索非差异表达基因比例不同的各种场景。因此,它提供了差异分析的关键成分的评估,没有与使用参数模型模拟数据相关的偏差。我们的结果表明了通过使用线性或广义线性建模对均值进行适当建模的相关性。一旦均值得到正确建模,其他参数对测试性能的影响要小得多。最后,我们建议使用原始P值直方图的简单可视化作为实际数据集上差异分析方法性能的实际评估标准。
公众号