Multiple hypothesis testing

多重假设检验
  • 文章类型: Journal Article
    线性混合模型(LMM)是全基因组关联研究(GWAS)的常用方法,旨在检测个体群体中遗传标记和表型测量之间的关联,同时考虑群体结构和隐秘的相关性。在标准GWAS中,进行了数十万到数百万次的统计测试,需要控制多个假设检验。通常,惩罚所执行测试数量的静态校正用于控制家庭错误率,这是使至少一个假阳性的概率。然而,研究表明,在实践中,该阈值对于正态分布表型过于保守,对于非正态分布表型不够严格.因此,最近提出了基于置换的LMM方法,以提供考虑表型分布的更现实的阈值。在这项工作中,我们将讨论基于排列的GWAS方法的优势,包括对AraPheno数据库中所有公开的拟南芥表型进行重新分析的新模拟和结果。
    Linear mixed models (LMMs) are a commonly used method for genome-wide association studies (GWAS) that aim to detect associations between genetic markers and phenotypic measurements in a population of individuals while accounting for population structure and cryptic relatedness. In a standard GWAS, hundreds of thousands to millions of statistical tests are performed, requiring control for multiple hypothesis testing. Typically, static corrections that penalize the number of tests performed are used to control for the family-wise error rate, which is the probability of making at least one false positive. However, it has been shown that in practice this threshold is too conservative for normally distributed phenotypes and not stringent enough for non-normally distributed phenotypes. Therefore, permutation-based LMM approaches have recently been proposed to provide a more realistic threshold that takes phenotypic distributions into account. In this work, we will discuss the advantages of permutation-based GWAS approaches, including new simulations and results from a re-analysis of all publicly available Arabidopsis thaliana phenotypes from the AraPheno database.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    多重假设检验是蛋白质组学等大规模技术数据分析的组成部分,转录组学,或者代谢组学,其中错误发现率(FDR)和正FDR(pFDR)已被接受为错误估计和控制措施。pFDR是错误发现比例(FDP)的期望值,这是指无效假设的数量与所有被拒绝的假设的数量之比。在实践中,期望比率近似为期望比率;然而,尚未调查将前者转变为后者的条件。这项工作得出了FDP的期望(pFDR)和方差的精确积分表达式。广泛使用的近似值(期望比)被证明是pFDR积分公式的一种特殊情况(在大样本量的限制下)。提供了递归公式来计算预定义数量的零假设的pFDR。FDP的方差被近似用于使用正向和反向蛋白质序列的肽鉴定中的实际应用。仿真表明,在假设数量较少的情况下,积分表达式比近似公式具有更好的准确性。对于大样本量,通过积分表达式和近似获得的pFDR没有实质性差异。包括对蛋白质组学数据集的应用。
    Multiple hypothesis testing is an integral component of data analysis for large-scale technologies such as proteomics, transcriptomics, or metabolomics, for which the false discovery rate (FDR) and positive FDR (pFDR) have been accepted as error estimation and control measures. The pFDR is the expectation of false discovery proportion (FDP), which refers to the ratio of the number of null hypotheses to that of all rejected hypotheses. In practice, the expectation of ratio is approximated by the ratio of expectation; however, the conditions for transforming the former into the latter have not been investigated. This work derives exact integral expressions for the expectation (pFDR) and variance of FDP. The widely used approximation (ratio of expectations) is shown to be a particular case (in the limit of a large sample size) of the integral formula for pFDR. A recurrence formula is provided to compute the pFDR for a predefined number of null hypotheses. The variance of FDP was approximated for a practical application in peptide identification using forward and reversed protein sequences. The simulations demonstrate that the integral expression exhibits better accuracy than the approximate formula in the case of a small number of hypotheses. For large sample sizes, the pFDRs obtained by the integral expression and approximation do not differ substantially. Applications to proteomics data sets are included.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    越来越多的现代科学问题出现在基因组学等领域,神经生物学,和空间流行病学涉及对数千个相关特征的测量和分析,这些特征可能在任意强的水平上随机依赖。在这项工作中,我们考虑特征遵循多变量正态分布的情况。我们证明了依赖性表现为特征之间共享的随机变化,标准方法可能由于依赖性而产生高度不稳定的推断,即使在过程中完全参数化和利用依赖性。我们提出了一个“跨维度推理”框架,通过建模和删除特征之间共享的变化来缓解由于依赖而导致的问题,同时也适当地正则化跨特征的估计。我们演示了从感兴趣的科学应用得出的场景中同时进行点估计和多个假设检验的框架。
    A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a \"cross-dimensional inference\" framework that alleviates the problems due to dependence by modeling and removing the variation shared among features, while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    通过吸引应用到当代大规模数据集来教学统计学对于吸引学生进入该领域至关重要。为此,我们开发了一个动手的,为期一周的高中或初中本科生研讨会,没有统计遗传学的先验知识,但有一些数据科学的基本知识,进行自己的全基因组关联研究(GWAS)。对开源基因表达数据进行GWAS,使用公开的人类遗传学数据。在详细的指导手册的协助下,学生能够从真正的科学研究中获得140万p值,几天之内。这种早期的动机使学生继续学习支持其结果的理论,包括回归,数据可视化,结果解释,和大规模多重假设检验。通过强调与这种类型的数据分析的个人联系来促进他们的学习动机,我们鼓励学生就GWAS如何为朋友或家人中存在的疾病的遗传基础提供见解进行简短的介绍。附加的开源,分步说明手册包括对所使用的数据集的描述,需要的软件,和研讨会的结果。此外,研讨会中使用的脚本在Github和Zenodo上存档,以进一步增强可重复的研究和培训。
    Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain ∼1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    多重假设检验已广泛应用于处理高维数据的问题,例如,从大量候选人中选择重要的变量或特征,同时控制错误率。在多假设检验中使用的错误率的最普遍的度量是错误发现率(FDR)。近年来,本地错误发现率(fdr)备受关注,由于它的优势是获得个人假设的信心。然而,大多数方法通过P$P$$-值或具有已知空分布的统计量来估计fdr,有时不可用或不可靠。采用基于竞争的程序的创新方法,例如,山寨过滤器,本文提出了一种新的方法,名为TDfdr,对fdr估计,它没有P$P$$-值或已知的空分布。大量的仿真研究表明,TDfdr可以通过两个基于竞争的程序准确地估计fdr。我们将TDfdr办法运用于两个现实的生物医学任务。一种是鉴定与COVID-19疾病相关的显著差异表达的蛋白质,另一种是检测与耐药性相关的HIV-1基因型的突变。与现有的流行方法相比,发现能力更高。
    Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr through P $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free of P $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基因组区域集的共定位分析已被广泛采用,以揭示相应生物学属性之间的潜在功能相互作用。这通常是进一步调查的基础。已经开发了许多方法用于基因组元件的共定位分析。然而,他们都没有明确考虑到转录组异质性和同工型模糊性,使它们不太适合分析转录组元件。这里,我们开发了RgnTX,一种R/生物导体工具,用于通过置换测试对转录组元件进行共定位分析。与现有方法不同,RgnTX直接利用转录组注释,并在空模型中提供高度灵活性,以模拟真实的转录组背景,如复杂的交替拼接模式。重要的是,它支持没有明确的同工型关联的转录组元件的测试,由于技术限制,这通常是真实的场景。提议的软件包提供了多种预定义的功能,易于用户用于可视化排列结果,计算移动的z分数,并在Benjamini-Hochberg校正下进行多重假设检验。此外,有了合成和真实的数据集,我们表明,与现有的基于基因组的方法相比,RgnTX新的测试模式返回不同和更显著的结果。我们相信RgnTX应该是一个有用的工具来描述转录组的随机性,并用于对异质转录组中的基因组区域集进行统计关联分析。该软件包现已被Biocorductor接受,可在以下网址免费获得:https://biocorductor.org/packages/RgnTX。
    Colocalization analysis of genomic region sets has been widely adopted to unveil potential functional interactions between corresponding biological attributes, which often serves as the basis for further investigation. A number of methods have been developed for colocalization analysis of genomic elements. However, none of them explicitly considered the transcriptome heterogeneity and isoform ambiguity, making them less appropriate for analyzing transcriptome elements. Here, we developed RgnTX, an R/Bioconductor tool for the colocalization analysis of transcriptome elements with permutation tests. Different from existing approaches, RgnTX directly takes advantage of transcriptome annotation, and offers high flexibility in the null model to simulate realistic transcriptome-wide background, such as the complex alternative splicing patterns. Importantly, it supports the testing of transcriptome elements without clear isoform association, which is often the real scenario due to technical limitations. Proposed package offers a wide selection of pre-defined functions, easy to be utilized by users for visualizing permutation results, calculating shifted z-scores and conducting multiple hypothesis testing under Benjamini-Hochberg correction. Moreover, with synthetic and real datasets, we show that RgnTX novel testing modes return distinct and more significant results compared to existing genome-based methods. We believe RgnTX should make a useful tool to characterize the randomness of the transcriptome, and for conducting statistical association analysis for genomic region sets within the heterogeneous transcriptome. The package now has been accepted by Bioconductor and is freely available at: https://bioconductor.org/packages/RgnTX.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究调查了2020年COVID-19大流行对中国股市的影响。利用三个行业的每日数据,本研究将异常股票收益的识别作为一个多重假设检验问题,并建议应用分组比较程序以更好地检测。通过比较每日信号的数量和具有异常正负收益的股票数量,实证结果表明,在大流行下,三个行业的表现不同。与非分组测试程序相比,分组程序发现的信号更加突出,这对于一些在重大事件发生时容易出现异常性能聚类的情况是有利的。本文通过对股票收益异常现象的研究,给出了一个新的视角,就像全球爆发的疾病一样。
    This study investigates the impact of COVID-19 pandemic on the Chinese stock market in 2020. Using daily data of three industries, this study addresses the identification of abnormal stock returns as a multiple hypothesis testing problem and proposes to apply a grouped comparison procedure for better detection. By comparing the numbers of daily signals and numbers of stocks with abnormal positive and negative returns, the empirical result shows that the three industries perform differently under the pandemic. Compared to the non-grouped testing procedure, the signals found by the grouped procedure are more prominent, which is advantageous for some situations when there tends to be abnormal performance clustering at the occurrence of major event. This paper on stock return anomalies gives a new perspective on the impact of major events to the stock market, like the global outbreak disease.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:全基因组测试,包括种系遗传变异的全基因组关联研究(GWAS),癌症体细胞突变的驱动测试,和RNAseq数据的全转录组关联测试,携带高多重测试负担。这种负担可以通过注册更大的队列来克服,或者通过使用先前的生物学知识来支持某些假设而不是其他假设来减轻。在这里,我们比较这两种方法的能力,以提高假设检验的能力。
    结果:我们为队列规模的进展提供了定量估计,并对经口硬先验的功效进行了理论分析:选择假设子集进行测试的先验,并保证所有真实阳性都在测试子集内。该理论表明,对于GWAS,将测试限制在100-1000个基因的强先验比典型的队列规模每年增加20-40%的能力要少。此外,即使从测试集中排除一小部分真阳性的非脉络膜先验的表现也比根本不使用先验的表现差.
    结论:我们的结果为简单,GWAS的无偏单变量假设检验:如果一个统计问题可以通过更大的队列大小来回答,应该通过更大的队列规模来回答,而不是通过涉及前科的更复杂的偏倚方法来回答。我们建议先验更适合生物学的非统计方面,如路径结构和因果关系,标准假设检验还不容易捕捉到。
    BACKGROUND: Genome-wide tests, including genome-wide association studies (GWAS) of germ-line genetic variants, driver tests of cancer somatic mutations, and transcriptome-wide association tests of RNAseq data, carry a high multiple testing burden. This burden can be overcome by enrolling larger cohorts or alleviated by using prior biological knowledge to favor some hypotheses over others. Here we compare these two methods in terms of their abilities to boost the power of hypothesis testing.
    RESULTS: We provide a quantitative estimate for progress in cohort sizes and present a theoretical analysis of the power of oracular hard priors: priors that select a subset of hypotheses for testing, with an oracular guarantee that all true positives are within the tested subset. This theory demonstrates that for GWAS, strong priors that limit testing to 100-1000 genes provide less power than typical annual 20-40% increases in cohort sizes. Furthermore, non-oracular priors that exclude even a small fraction of true positives from the tested set can perform worse than not using a prior at all.
    CONCLUSIONS: Our results provide a theoretical explanation for the continued dominance of simple, unbiased univariate hypothesis tests for GWAS: if a statistical question can be answered by larger cohort sizes, it should be answered by larger cohort sizes rather than by more complicated biased methods involving priors. We suggest that priors are better suited for non-statistical aspects of biology, such as pathway structure and causality, that are not yet easily captured by standard hypothesis tests.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对大规模数据集的分析,尤其是在生物医学领域,通常涉及对多个假设的原则性筛选。著名的两组模型共同模拟了具有两种竞争密度的混合的测试统计量的分布,null和替代分布。我们研究了加权密度的使用,特别是,非局部密度作为工作替代分布,强制与null分离,从而完善筛选程序。我们展示了这些加权替代品如何改善各种运行特性,例如贝叶斯错误发现率,在相对于当地的固定混合物比例的最终测试中,未加权似然法。提出了参数和非参数模型规范,以及用于后验推理的有效采样器。通过模拟研究,我们展示了我们的模型在各种操作特性方面如何与完善的和最先进的替代方案进行比较。最后,为了说明我们方法的多功能性,我们使用来自异质性基因组研究的公开数据集进行了三项差异表达分析.
    The analysis of large-scale datasets, especially in biomedical contexts, frequently involves a principled screening of multiple hypotheses. The celebrated two-group model jointly models the distribution of the test statistics with mixtures of two competing densities, the null and the alternative distributions. We investigate the use of weighted densities and, in particular, non-local densities as working alternative distributions, to enforce separation from the null and thus refine the screening procedure. We show how these weighted alternatives improve various operating characteristics, such as the Bayesian false discovery rate, of the resulting tests for a fixed mixture proportion with respect to a local, unweighted likelihood approach. Parametric and nonparametric model specifications are proposed, along with efficient samplers for posterior inference. By means of a simulation study, we exhibit how our model compares with both well-established and state-of-the-art alternatives in terms of various operating characteristics. Finally, to illustrate the versatility of our method, we conduct three differential expression analyses with publicly-available datasets from genomic studies of heterogeneous nature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    这项工作提出了一个两阶段程序,用于识别大维数据集中的外围观测值。在第一阶段,通过使用max-normal统计量定义离群值识别度量,并获得包含非离群值的干净子集。异常值的识别可以被认为是一个多重假设检验问题,然后,在第二阶段,我们探索了所提出的测度的渐近分布,并获得外围观测值的阈值。此外,为了提高识别能力,更好地控制误判率,提出了一步的细化算法。仿真结果和两个实际数据分析实例表明,与其他方法相比,所提出的程序在识别各种数据情况下的异常值方面具有很大的优势。
    This work proposes a two-stage procedure for identifying outlying observations in a large-dimensional data set. In the first stage, an outlier identification measure is defined by using a max-normal statistic and a clean subset that contains non-outliers is obtained. The identification of outliers can be deemed as a multiple hypothesis testing problem, then, in the second stage, we explore the asymptotic distribution of the proposed measure, and obtain the threshold of the outlying observations. Furthermore, in order to improve the identification power and better control the misjudgment rate, a one-step refined algorithm is proposed. Simulation results and two real data analysis examples show that, compared with other methods, the proposed procedure has great advantages in identifying outliers in various data situations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号