multiple testing

多重测试
  • 文章类型: Journal Article
    确定基因型和表型之间的因果关系对于扩大我们对跨越分子水平到可感知性状的基因调控网络的理解至关重要。多效性基因可以充当网络中的中心枢纽,影响多种结果。识别这样的基因涉及在复合零假设下进行测试,其中该基因与,最多,一个特质。已经提出了传统的方法,例如对最高$P值进行荟萃分析和对多个性状进行顺序测试,但是这些方法没有考虑到全基因组信号的背景。由于Huang\的复合测试在复合零下为全基因组变体产生均匀分布的$P$值,我们提出了一种基因水平多效性测试,该测试需要将上述方法与聚集的柯西关联测试相结合.多基因性状涉及具有不同功能的多个基因以共同调节机制。我们表明,在鉴定多效性基因时,应考虑多源性;否则,多基因性状启动的关联会导致假阳性。在这项研究中,我们使用提出的多效性测试的结果构建了基因-性状功能模块。我们的分析套件被实现为R包PGCtest。我们通过对台湾生物库数据库的应用研究证明了所提出的方法,并确定了包含特定基因及其共同调节性状的功能模块。
    Identifying the causal relationship between genotype and phenotype is essential to expanding our understanding of the gene regulatory network spanning the molecular level to perceptible traits. A pleiotropic gene can act as a central hub in the network, influencing multiple outcomes. Identifying such a gene involves testing under a composite null hypothesis where the gene is associated with, at most, one trait. Traditional methods such as meta-analyses of top-hit $P$-values and sequential testing of multiple traits have been proposed, but these methods fail to consider the background of genome-wide signals. Since Huang\'s composite test produces uniformly distributed $P$-values for genome-wide variants under the composite null, we propose a gene-level pleiotropy test that entails combining the aforementioned method with the aggregated Cauchy association test. A polygenic trait involves multiple genes with different functions to co-regulate mechanisms. We show that polygenicity should be considered when identifying pleiotropic genes; otherwise, the associations polygenic traits initiate will give rise to false positives. In this study, we constructed gene-trait functional modules using the results of the proposed pleiotropy tests. Our analysis suite was implemented as an R package PGCtest. We demonstrated the proposed method with an application study of the Taiwan Biobank database and identified functional modules comprising specific genes and their co-regulated traits.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在免疫学研究中,流式细胞术是一种常用的多变量单细胞检测方法。流式细胞术分析的一个关键目标是检测对某些刺激有反应的免疫细胞。统计上,这个问题可以转化为比较刺激前后的两个蛋白质表达概率密度函数(pdfs);目标是确定这两个pdfs不同的区域。可以进行这些差异区域的进一步筛选以鉴定富集的响应细胞组。在本文中,我们将识别差异密度区域建模为多重测试问题。首先,我们将样本空间分成小的箱子。在每个垃圾箱中,我们形成了一个假设来检验微分pdfs的存在。第二,我们开发了一种新颖的多重测试方法,称为TEAM(聚合树方法上的测试),在将错误发现率(FDR)控制在所需水平下的同时,识别那些含有差异PDF的垃圾箱。TEAM将测试程序嵌入到聚合树中,以从精细分辨率到粗略分辨率进行测试。该过程实现了将密度差异精确定位到最小可能区域的统计目标。团队的计算效率很高,与竞争方法相比,能够在更短的时间内分析大型流式细胞术数据集。我们将TEAM和竞争方法应用于流式细胞术数据集以鉴定响应巨细胞病毒(CMV)-pp65抗原刺激的T细胞。通过额外的下游筛选,团队成功地确定了含有单官能的富集集,双功能,和多功能T细胞。竞争方法要么没有在合理的时间范围内完成,要么提供的结果解释性较差。数值模拟和理论证明,TEAM具有渐近有效性,强大,和强大的性能。总的来说,TEAM是一种计算高效且统计强大的算法,可以在流式细胞术研究中产生有意义的生物学见解。
    In immunology studies, flow cytometry is a commonly used multivariate single-cell assay. One key goal in flow cytometry analysis is to detect the immune cells responsive to certain stimuli. Statistically, this problem can be translated into comparing two protein expression probability density functions (pdfs) before and after the stimulus; the goal is to pinpoint the regions where these two pdfs differ. Further screening of these differential regions can be performed to identify enriched sets of responsive cells. In this paper, we model identifying differential density regions as a multiple testing problem. First, we partition the sample space into small bins. In each bin, we form a hypothesis to test the existence of differential pdfs. Second, we develop a novel multiple testing method, called TEAM (Testing on the Aggregation tree Method), to identify those bins that harbor differential pdfs while controlling the false discovery rate (FDR) under the desired level. TEAM embeds the testing procedure into an aggregation tree to test from fine- to coarse-resolution. The procedure achieves the statistical goal of pinpointing density differences to the smallest possible regions. TEAM is computationally efficient, capable of analyzing large flow cytometry data sets in much shorter time compared with competing methods. We applied TEAM and competing methods on a flow cytometry data set to identify T cells responsive to the cytomegalovirus (CMV)-pp65 antigen stimulation. With additional downstream screening, TEAM successfully identified enriched sets containing monofunctional, bifunctional, and polyfunctional T cells. Competing methods either did not finish in a reasonable time frame or provided less interpretable results. Numerical simulations and theoretical justifications demonstrate that TEAM has asymptotically valid, powerful, and robust performance. Overall, TEAM is a computationally efficient and statistically powerful algorithm that can yield meaningful biological insights in flow cytometry studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当假设存在逻辑嵌套结构时,我们考虑多个假设检验的问题。当一个假设嵌套在另一个假设中时,如果内部假设是错误的,则外部假设必须是错误的。我们将嵌套结构建模为有向无环图,包括链图和树图作为特殊情况。图中的每个节点都是一个假设,拒绝一个节点也需要拒绝它的所有祖先。我们提出了一个通用框架,用于使用已知的逻辑约束来调整节点级测试统计信息。在这个框架内,我们研究了一个平滑过程,该过程将每个节点与其所有后代结合起来,以形成一个更强大的统计量。我们证明了一类广泛的平滑策略可以与现有的选择程序一起使用来控制家庭错误率,错误发现超标率,或者错误的发现率,只要原始测试统计信息在null下是独立的。当零统计量不是独立的,而是来自正相关的正态观察时,当平滑方法是对观测值进行算术平均时,我们证明了对所有三个错误率的控制。模拟和对真实生物学数据集的应用表明,平滑会导致大量的功率增益。
    We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    错误发现率(FDR)是用于涉及多个假设检验的基因组数据分析的统计显著性的广泛使用的度量。在计划进行这些类型的基因组数据分析的研究中,功率和样本量的考虑非常重要。这里,我们提出了p值直方图的三矩形近似,以得出一个公式来计算涉及FDR的分析的统计能力和样本大小。我们还介绍了R软件包FDRsamplesize2,该软件包结合了这些和其他功率计算公式,以计算其他FDR功率计算软件未涵盖的各种研究的功率。提供了几个说明性示例。FDRsamplesize2软件包在CRAN上可用。
    The false discovery rate (FDR) is a widely used metric of statistical significance for genomic data analyses that involve multiple hypothesis testing. Power and sample size considerations are important in planning studies that perform these types of genomic data analyses. Here, we propose a three-rectangle approximation of a p-value histogram to derive a formula to compute the statistical power and sample size for analyses that involve the FDR. We also introduce the R package FDRsamplesize2, which incorporates these and other power calculation formulas to compute power for a broad variety of studies not covered by other FDR power calculation software. A few illustrative examples are provided. The FDRsamplesize2 package is available on CRAN.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    多重检验一直是统计研究中的一个突出课题。尽管在这方面做了大量的工作,控制错误发现仍然是一项具有挑战性的任务,特别是当检验统计量表现出依赖性时。已经提出了各种方法来估计在测试统计量之间的任意依赖性下的错误发现比例(FDP)。一种关键方法是将任意依赖转化为弱依赖,并随后建立FDP的强一致性和弱依赖下的错误发现率。因此,FDP在弱依赖框架内收敛到相同的渐近极限。然而,我们已经观察到,FDP的渐近方差可以显著影响的依赖结构的检验统计,即使它们只表现出微弱的依赖性。量化这种可变性具有非常重要的实际意义,因为它可以作为从数据中评估FDP质量的指标。据我们所知,文献中对这方面的研究有限。在本文中,我们的目标是通过量化FDP的变化来填补这一空白,假设检验统计量表现出弱依赖性,服从正态分布。我们首先推导FDP的渐近展开,然后研究FDP的渐近方差如何受到不同依赖结构的影响。基于从这项研究中获得的见解,我们建议在使用FDP的多个测试程序中,报告FDP的均值和方差估计值可以为研究结果提供更全面的评估.
    Multiple testing has been a prominent topic in statistical research. Despite extensive work in this area, controlling false discoveries remains a challenging task, especially when the test statistics exhibit dependence. Various methods have been proposed to estimate the false discovery proportion (FDP) under arbitrary dependencies among the test statistics. One key approach is to transform arbitrary dependence into weak dependence and subsequently establish the strong consistency of FDP and false discovery rate under weak dependence. As a result, FDPs converge to the same asymptotic limit within the framework of weak dependence. However, we have observed that the asymptotic variance of FDP can be significantly influenced by the dependence structure of the test statistics, even when they exhibit only weak dependence. Quantifying this variability is of great practical importance, as it serves as an indicator of the quality of FDP estimation from the data. To the best of our knowledge, there is limited research on this aspect in the literature. In this paper, we aim to fill in this gap by quantifying the variation of FDP, assuming that the test statistics exhibit weak dependence and follow normal distributions. We begin by deriving the asymptotic expansion of the FDP and subsequently investigate how the asymptotic variance of the FDP is influenced by different dependence structures. Based on the insights gained from this study, we recommend that in multiple testing procedures utilizing FDP, reporting both the mean and variance estimates of FDP can provide a more comprehensive assessment of the study\'s outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    诊断准确性研究评估新指标测试相对于已建立的比较器或参考标准的敏感性和特异性。通常假定在准确性研究之前进行指标测试的开发和选择。在实践中,这经常被违反,例如,如果选择(显然)最好的生物标志物,模型或切割点基于稍后用于验证目的的相同数据。在这项工作中,我们研究了几种多重比较程序,这些程序为新出现的多重测试问题提供了家庭错误率控制。由于共同假设问题的性质,传统的多重性调整方法对于特定问题过于保守,因此需要进行调整。在广泛的模拟研究中,在最不利和现实的情况下,比较了五种多重比较程序的统计错误率。这涵盖了参数和非参数方法以及一种贝叶斯方法。所有方法都已在新的开源R包案例中实现,这使我们能够重现所有仿真结果。根据我们的数值结果,我们得出的结论是,参数方法(maxT和Bonferroni)很容易应用,但对于小样本量,可能会膨胀I型错误率。这两个人调查了Bootstrap程序,特别是所谓的双引导,允许在有限样本中进行家族错误率控制,此外还具有竞争统计能力。
    Diagnostic accuracy studies assess the sensitivity and specificity of a new index test in relation to an established comparator or the reference standard. The development and selection of the index test are usually assumed to be conducted prior to the accuracy study. In practice, this is often violated, for instance, if the choice of the (apparently) best biomarker, model or cutpoint is based on the same data that is used later for validation purposes. In this work, we investigate several multiple comparison procedures which provide family-wise error rate control for the emerging multiple testing problem. Due to the nature of the co-primary hypothesis problem, conventional approaches for multiplicity adjustment are too conservative for the specific problem and thus need to be adapted. In an extensive simulation study, five multiple comparison procedures are compared with regard to statistical error rates in least-favourable and realistic scenarios. This covers parametric and non-parametric methods and one Bayesian approach. All methods have been implemented in the new open-source R package cases which allows us to reproduce all simulation results. Based on our numerical results, we conclude that the parametric approaches (maxT and Bonferroni) are easy to apply but can have inflated type I error rates for small sample sizes. The two investigated Bootstrap procedures, in particular the so-called pairs Bootstrap, allow for a family-wise error rate control in finite samples and in addition have a competitive statistical power.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生存时间是许多随机对照试验的主要终点,治疗效果通常在比例风险假设下通过风险比进行量化。意识到在许多情况下,这个假设是先验违反的,例如,由于药物作用的延迟发作。在这些情况下,对风险比估计的解释是模糊的,并且有必要对替代参数进行统计推断以量化治疗效果。我们考虑里程碑生存概率或分位数的差异或比率,限制平均生存时间的差异,和平均危险比值得关注。通常,需要报告一个以上的参数以评估可能的治疗益处,在验证性试验中,根据推理程序需要针对多重性进行调整。简单的Bonferroni调整可能过于保守,因为不同的感兴趣参数通常显示出相当大的相关性。因此,需要考虑相关性的同时推理程序。通过使用上述参数的计数过程表示,我们证明了它们的估计是渐近多变量正态的,并给出了它们的协方差矩阵的估计。我们根据参数提出了多个测试程序和同时的置信区间。此外,logrank测试可能包含在框架中。通过仿真研究了有限样本I型错误率和功率。用来自肿瘤学的实例说明所述方法。在R包nph中提供了软件实现。
    Survival time is the primary endpoint of many randomized controlled trials, and a treatment effect is typically quantified by the hazard ratio under the assumption of proportional hazards. Awareness is increasing that in many settings this assumption is a priori violated, for example, due to delayed onset of drug effect. In these cases, interpretation of the hazard ratio estimate is ambiguous and statistical inference for alternative parameters to quantify a treatment effect is warranted. We consider differences or ratios of milestone survival probabilities or quantiles, differences in restricted mean survival times, and an average hazard ratio to be of interest. Typically, more than one such parameter needs to be reported to assess possible treatment benefits, and in confirmatory trials, the according inferential procedures need to be adjusted for multiplicity. A simple Bonferroni adjustment may be too conservative because the different parameters of interest typically show considerable correlation. Hence simultaneous inference procedures that take into account the correlation are warranted. By using the counting process representation of the mentioned parameters, we show that their estimates are asymptotically multivariate normal and we provide an estimate for their covariance matrix. We propose according to the parametric multiple testing procedures and simultaneous confidence intervals. Also, the logrank test may be included in the framework. Finite sample type I error rate and power are studied by simulation. The methods are illustrated with an example from oncology. A software implementation is provided in the R package nph.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在变量选择和图估计等问题中,模型的特征在于布尔逻辑结构,例如变量或边的存在或不存在。因此,可以将假阳性错误或假阴性错误指定为在估计模型中错误地包括或排除的变量/边的数量。然而,还有其他几个问题,如排名,聚类,和因果推理,其中由于缺乏基础的布尔逻辑结构,相关的模型类不承认假阳性和假阴性错误的透明概念。在本文中,我们提出了一种通用的方法来赋予一组具有偏序结构的模型,这导致了模型类的分层组织以及假阳性和假阴性错误的自然类似物。我们描述了在我们的一般设置中提供假阳性误差控制的模型选择程序,我们用数值实验来说明它们的效用。
    In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as the presence or absence of a variable or an edge. Consequently, false-positive error or false-negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false-positive and false-negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false-positive and false-negative errors. We describe model selection procedures that provide false-positive error control in our general setting, and we illustrate their utility with numerical experiments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    表达数量性状基因座(eQTL)分析是鉴定与基因表达水平相关的遗传基因座的有用工具。诸如基因型-组织表达(GTEx)项目之类的大型协作努力为不同组织中的eQTL分析提供了宝贵的资源。大多数现有的方法,然而,要么一次集中在一个组织上,或分析多个组织以鉴定联合存在于多个组织中的eQTL。缺乏有效的方法来识别靶组织中的eQTL,同时有效地借用辅助组织的强度。在本文中,我们提出了一种新的统计框架,利用来自其他组织的辅助信息来提高感兴趣组织中的eQTL检测效率。该框架可以通过将来自多个组织的共享和特定效应纳入测试统计来增强对eQTL效应的假设检验的能力。我们还设计了数据驱动和分布式计算方法,以在组织数量大时有效实现eQTL检测。模拟中的数值研究证明了所提出方法的有效性,GTEx实例的真实数据分析提供了对不同组织中eQTL发现的新见解。
    Expression quantitative trait locus (eQTL) analysis is a useful tool to identify genetic loci that are associated with gene expression levels. Large collaborative efforts such as the Genotype-Tissue Expression (GTEx) project provide valuable resources for eQTL analysis in different tissues. Most existing methods, however, either focus on one tissue at a time, or analyze multiple tissues to identify eQTLs jointly present in multiple tissues. There is a lack of powerful methods to identify eQTLs in a target tissue while effectively borrowing strength from auxiliary tissues. In this paper, we propose a novel statistical framework to improve the eQTL detection efficacy in the tissue of interest with auxiliary information from other tissues. This framework can enhance the power of the hypothesis test for eQTL effects by incorporating shared and specific effects from multiple tissues into the test statistics. We also devise data-driven and distributed computing approaches for efficient implementation of eQTL detection when the number of tissues is large. Numerical studies in simulation demonstrate the efficacy of the proposed method, and the real data analysis of the GTEx example provides novel insights into eQTL findings in different tissues.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    研究了高维高斯图形模型(GGMs)的迁移学习。目标GGM是通过纳入来自类似和相关辅助研究的数据来估计的,其中,目标图和每个辅助图之间的相似性由散度矩阵的稀疏性表征。一种估计算法,跨气候,提出并显示在单任务设置中获得比minimax速率更快的收敛速率。此外,我们介绍了一种通用的去偏置方法,该方法可以与一系列初始图估计器耦合,并且可以在一个步骤中进行分析计算。然后构造一个去偏的跨CLIME估计器,并显示为元素渐近正态。此事实用于构造具有错误发现率控制的边缘检测的多测试程序。所提出的估计和多个测试程序在模拟中证明了卓越的数值性能,并通过利用来自多个其他脑组织的基因表达来推断目标脑组织中的基因网络。观察到预测误差的显著减少和链路检测的功率的显著增加。
    Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号