false discovery rate

错误发现率
  • 文章类型: Journal Article
    多重假设检验是蛋白质组学等大规模技术数据分析的组成部分,转录组学,或者代谢组学,其中错误发现率(FDR)和正FDR(pFDR)已被接受为错误估计和控制措施。pFDR是错误发现比例(FDP)的期望值,这是指无效假设的数量与所有被拒绝的假设的数量之比。在实践中,期望比率近似为期望比率;然而,尚未调查将前者转变为后者的条件。这项工作得出了FDP的期望(pFDR)和方差的精确积分表达式。广泛使用的近似值(期望比)被证明是pFDR积分公式的一种特殊情况(在大样本量的限制下)。提供了递归公式来计算预定义数量的零假设的pFDR。FDP的方差被近似用于使用正向和反向蛋白质序列的肽鉴定中的实际应用。仿真表明,在假设数量较少的情况下,积分表达式比近似公式具有更好的准确性。对于大样本量,通过积分表达式和近似获得的pFDR没有实质性差异。包括对蛋白质组学数据集的应用。
    Multiple hypothesis testing is an integral component of data analysis for large-scale technologies such as proteomics, transcriptomics, or metabolomics, for which the false discovery rate (FDR) and positive FDR (pFDR) have been accepted as error estimation and control measures. The pFDR is the expectation of false discovery proportion (FDP), which refers to the ratio of the number of null hypotheses to that of all rejected hypotheses. In practice, the expectation of ratio is approximated by the ratio of expectation; however, the conditions for transforming the former into the latter have not been investigated. This work derives exact integral expressions for the expectation (pFDR) and variance of FDP. The widely used approximation (ratio of expectations) is shown to be a particular case (in the limit of a large sample size) of the integral formula for pFDR. A recurrence formula is provided to compute the pFDR for a predefined number of null hypotheses. The variance of FDP was approximated for a practical application in peptide identification using forward and reversed protein sequences. The simulations demonstrate that the integral expression exhibits better accuracy than the approximate formula in the case of a small number of hypotheses. For large sample sizes, the pFDRs obtained by the integral expression and approximation do not differ substantially. Applications to proteomics data sets are included.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    可复制性是现代科学研究的基石。在多个全基因组关联研究(GWAS)中重要的基因型-表型关联的可靠鉴定为该发现提供了更有力的证据。当前的可复制性分析依赖于单核苷酸多态性(SNP)之间的独立性假设,而忽略了连锁不平衡(LD)结构。我们证明,这种策略在实践中可能会产生过于自由或过于保守的结果。我们开发了一种有效的方法,阅读,从解释LD结构的两个GWAS中检测与表型相关的可复制SNP。通过建立在两个p值序列上的四状态隐马尔可夫模型(HMM)捕获了两个异质研究中SNP的局部依赖性结构。通过HMM合并来自相邻位置的信息,我们的方法提供了更准确的SNP显著性排名.ReAD是可扩展的,平台独立,并且比现有的可复制性分析方法更强大,具有有效的错误发现率控制。通过分析来自两个哮喘GWASs和两个溃疡性结肠炎GWASs的数据集,我们表明ReAD可以识别现有方法可能错过的可复制遗传基因座。
    Replicability is the cornerstone of modern scientific research. Reliable identifications of genotype-phenotype associations that are significant in multiple genome-wide association studies (GWASs) provide stronger evidence for the findings. Current replicability analysis relies on the independence assumption among single-nucleotide polymorphisms (SNPs) and ignores the linkage disequilibrium (LD) structure. We show that such a strategy may produce either overly liberal or overly conservative results in practice. We develop an efficient method, ReAD, to detect replicable SNPs associated with the phenotype from two GWASs accounting for the LD structure. The local dependence structure of SNPs across two heterogeneous studies is captured by a four-state hidden Markov model (HMM) built on two sequences of p values. By incorporating information from adjacent locations via the HMM, our approach provides more accurate SNP significance rankings. ReAD is scalable, platform independent, and more powerful than existing replicability analysis methods with effective false discovery rate control. Through analysis of datasets from two asthma GWASs and two ulcerative colitis GWASs, we show that ReAD can identify replicable genetic loci that existing methods might otherwise miss.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当假设存在逻辑嵌套结构时,我们考虑多个假设检验的问题。当一个假设嵌套在另一个假设中时,如果内部假设是错误的,则外部假设必须是错误的。我们将嵌套结构建模为有向无环图,包括链图和树图作为特殊情况。图中的每个节点都是一个假设,拒绝一个节点也需要拒绝它的所有祖先。我们提出了一个通用框架,用于使用已知的逻辑约束来调整节点级测试统计信息。在这个框架内,我们研究了一个平滑过程,该过程将每个节点与其所有后代结合起来,以形成一个更强大的统计量。我们证明了一类广泛的平滑策略可以与现有的选择程序一起使用来控制家庭错误率,错误发现超标率,或者错误的发现率,只要原始测试统计信息在null下是独立的。当零统计量不是独立的,而是来自正相关的正态观察时,当平滑方法是对观测值进行算术平均时,我们证明了对所有三个错误率的控制。模拟和对真实生物学数据集的应用表明,平滑会导致大量的功率增益。
    We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本文开发了一种基于模型X的方法,以找到跨环境一致的条件关联,控制错误发现率。此问题的动机是,大型数据集可能包含许多具有统计意义但具有误导性的关联,因为它们是由混杂因素或采样缺陷引起的。然而,在不同条件下复制的关联可能更有趣。事实上,一致性有时可以证明会导致有效的因果推断,即使条件关联不会。虽然所提出的方法是广泛适用的,本文强调了它与全基因组关联研究的相关性,其中具有不同祖先的群体之间的稳健性减轻了由于未测量的变异而造成的混淆。通过对英国生物库数据的模拟和应用证明了这种方法的有效性。
    This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is widely applicable, this paper highlights its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:表达数量性状基因座(eQTL)分析旨在检测影响一种或多种基因表达的遗传变异。基因水平的eQTL测试形成了具有明确生物学重要性的自然分组假设测试策略。早期已经提出了控制分组测试的家庭错误率或错误发现率的方法,但可能不强大或不容易应用于eQTL数据,对于某些结构化的替代方案可能是有道理的,并可能使研究人员避免过于保守的方法。
    结果:在经验贝叶斯背景下,我们提出了一种新的方法来控制分组假设的错误发现率(FDR)。这里,每个基因组成一个群体,SNP注释到对应于个体假设的基因。通过引入随机效应成分来考虑不同组中效应大小的异质性。我们的方法,题为“集团级FDR控制(REG-FDR)的随机效应模型和测试程序”,假设eQTL数据的替代假设模型,并通过自适应阈值控制FDR。作为一种方便的替代方法,我们还提出了Z-REG-FDR,REG-FDR的近似版本,仅使用每个基因-SNP对的基因型和表达之间的关联的Z统计。使用模拟数据和真实数据评估Z-REG-FDR的性能。仿真表明,Z-REG-FDR的性能与REG-FDR相似,但是计算速度大大提高了。
    结论:我们的结果表明,与其他方法相比,Z-REG-FDR方法在统计功效和FDR控制方面表现良好。由于其快速计算和仅使用汇总数据拟合的能力,因此对于用于eQTL分析或统计基因组学中的类似问题的分组假设检验具有重要的实际用途。
    BACKGROUND: Expression quantitative trait locus (eQTL) analysis aims to detect the genetic variants that influence the expression of one or more genes. Gene-level eQTL testing forms a natural grouped-hypothesis testing strategy with clear biological importance. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not be powerful or easily apply to eQTL data, for which certain structured alternatives may be defensible and may enable the researcher to avoid overly conservative approaches.
    RESULTS: In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypotheses. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. The heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. As a convenient alternate approach, we also propose Z-REG-FDR, an approximate version of REG-FDR, that uses only Z-statistics of association between genotype and expression for each gene-SNP pair. The performance of Z-REG-FDR is evaluated using both simulated and real data. Simulations demonstrate that Z-REG-FDR performs similarly to REG-FDR, but with much improved computational speed.
    CONCLUSIONS: Our results demonstrate that the Z-REG-FDR method performs favorably compared to other methods in terms of statistical power and control of FDR. It can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics due to its fast computation and ability to be fit using only summary data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    错误发现率(FDR)是用于涉及多个假设检验的基因组数据分析的统计显著性的广泛使用的度量。在计划进行这些类型的基因组数据分析的研究中,功率和样本量的考虑非常重要。这里,我们提出了p值直方图的三矩形近似,以得出一个公式来计算涉及FDR的分析的统计能力和样本大小。我们还介绍了R软件包FDRsamplesize2,该软件包结合了这些和其他功率计算公式,以计算其他FDR功率计算软件未涵盖的各种研究的功率。提供了几个说明性示例。FDRsamplesize2软件包在CRAN上可用。
    The false discovery rate (FDR) is a widely used metric of statistical significance for genomic data analyses that involve multiple hypothesis testing. Power and sample size considerations are important in planning studies that perform these types of genomic data analyses. Here, we propose a three-rectangle approximation of a p-value histogram to derive a formula to compute the statistical power and sample size for analyses that involve the FDR. We also introduce the R package FDRsamplesize2, which incorporates these and other power calculation formulas to compute power for a broad variety of studies not covered by other FDR power calculation software. A few illustrative examples are provided. The FDRsamplesize2 package is available on CRAN.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    越来越多的现代科学问题出现在基因组学等领域,神经生物学,和空间流行病学涉及对数千个相关特征的测量和分析,这些特征可能在任意强的水平上随机依赖。在这项工作中,我们考虑特征遵循多变量正态分布的情况。我们证明了依赖性表现为特征之间共享的随机变化,标准方法可能由于依赖性而产生高度不稳定的推断,即使在过程中完全参数化和利用依赖性。我们提出了一个“跨维度推理”框架,通过建模和删除特征之间共享的变化来缓解由于依赖而导致的问题,同时也适当地正则化跨特征的估计。我们演示了从感兴趣的科学应用得出的场景中同时进行点估计和多个假设检验的框架。
    A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a \"cross-dimensional inference\" framework that alleviates the problems due to dependence by modeling and removing the variation shared among features, while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    估计肽鉴定的错误发现率(FDR)是蛋白质组学数据分析的关键步骤。并且已经为此提出了许多方法。最近,用于验证FDR估计方法的诱捕式协议出现在展示新的光谱库搜索工具的文章中.该验证方法涉及通过针对原始目标搜索空间搜索来自进化上遥远的生物体的光谱(诱捕查询)来生成不正确的光谱匹配。虽然这种方法可能看起来类似于使用诱捕数据库的解决方案,它代表了一个独特的概念框架,其正确性尚未得到验证。在这个观点中,我们首先讨论了基于诱捕的验证方案的背景,然后进行了一些简单的计算实验来验证其背后的假设.结果表明,诱捕数据库可能,在一些实现中,是一个合理的验证选择,而支持基于诱捕查询的验证协议的假设在实践中可能会被违反。本文还强调需要精心设计的框架来验证蛋白质组学中的FDR估计方法。
    Estimating the false discovery rate (FDR) of peptide identifications is a key step in proteomics data analysis, and many methods have been proposed for this purpose. Recently, an entrapment-inspired protocol to validate methods for FDR estimation appeared in articles showcasing new spectral library search tools. That validation approach involves generating incorrect spectral matches by searching spectra from evolutionarily distant organisms (entrapment queries) against the original target search space. Although this approach may appear similar to the solutions using entrapment databases, it represents a distinct conceptual framework whose correctness has not been verified yet. In this viewpoint, we first discussed the background of the entrapment-based validation protocols and then conducted a few simple computational experiments to verify the assumptions behind them. The results reveal that entrapment databases may, in some implementations, be a reasonable choice for validation, while the assumptions underpinning validation protocols based on entrapment queries are likely to be violated in practice. This article also highlights the need for well-designed frameworks for validating FDR estimation methods in proteomics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    为串联质谱蛋白质组学实验产生的发现分配统计置信度估计对于实现对结果的原则性解释和评估实验随访的成本/收益比至关重要。计算此类估计的最常见技术是使用目标诱饵竞争(TDC),其中针对真实(目标)肽的数据库和改组或反向(诱饵)肽的数据库搜索观察到的光谱。已开发出用于在给定分数阈值下估计错误发现率(FDR)的TDC程序,用于在光谱水平上的应用。肽,或蛋白质。尽管这些技术实现起来相对简单,在文献中,通常会跳过实施细节,甚至在TDC程序在实践中的应用中犯错误。这里我们介绍Crema,一个开源的Python工具,它实现了频谱的几种TDC方法,肽和蛋白质水平的FDR估计。Crema与各种现有的数据库搜索工具兼容,并提供了一种直接的方法来获得可靠的FDR估计。
    Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文探讨了暴露概念及其在阐明环境暴露与人类健康之间的相互作用中的作用。我们介绍了对曝光组学研究至关重要的两个关键概念。首先,我们讨论了遗传和环境对表型的共同影响,强调可归因于共有和非共有环境因素的差异,强调量化暴露体对健康结果影响的复杂性。其次,我们在大型队列研究中介绍了先进的数据驱动方法对暴露组学测量的重要性.这里,我们介绍了全曝光组关联研究(ExWAS),一种旨在系统地发现表型和各种暴露之间关系的方法,识别显著的关联,同时控制多重比较。我们提倡标准化使用术语“广泛的关联研究”,ExWAS,“便于本领域的清晰交流和文献检索。本文旨在指导未来的健康研究人员理解和评估暴露组学研究。我们的讨论延伸到新兴的话题,如公平数据原则,生物监控医疗保健数据集,和功能暴露,概述了未来曝光学研究的方向。本摘要简要概述了我们的全面方法,以了解暴露的复杂动态及其对人类健康的重要影响。
    This paper explores the exposome concept and its role in elucidating the interplay between environmental exposures and human health. We introduce two key concepts critical for exposomics research. Firstly, we discuss the joint impact of genetics and environment on phenotypes, emphasizing the variance attributable to shared and nonshared environmental factors, underscoring the complexity of quantifying the exposome\'s influence on health outcomes. Secondly, we introduce the importance of advanced data-driven methods in large cohort studies for exposomic measurements. Here, we introduce the exposome-wide association study (ExWAS), an approach designed for systematic discovery of relationships between phenotypes and various exposures, identifying significant associations while controlling for multiple comparisons. We advocate for the standardized use of the term \"exposome-wide association study, ExWAS,\" to facilitate clear communication and literature retrieval in this field. The paper aims to guide future health researchers in understanding and evaluating exposomic studies. Our discussion extends to emerging topics, such as FAIR Data Principles, biobanked healthcare datasets, and the functional exposome, outlining the future directions in exposomic research. This abstract provides a succinct overview of our comprehensive approach to understanding the complex dynamics of the exposome and its significant implications for human health.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号