false discovery rate

错误发现率
  • 文章类型: Journal Article
    当可用变量的数量相对于观察数量较大时,协变量选择在流行病学中是有问题的,并且仍然是继续研究的重点。虽然已经开发了各种统计方法来试图克服这个问题,目前,很少有方法可用于包含聚集结果的广泛数据。这项研究的目的是在对因变量进行聚类时,对广泛数据设置中的协变量选择的新方法进行实证评估。我们使用了3300个模拟数据集,这些数据集具有各种定义的结构和已知的真实预测变量集,以对混合模型稳定性选择程序进行经验评估。与基于最小绝对收缩和选择算子(Lasso)惩罚的正则化的替代方法进行了比较。使用包括真阳性率(在最终模型中选择的真协变量的比例)和错误发现率(在最终模型中选择的是非真(假)变量的变量的比例)的若干度量来评估模型性能。对于稳定性选择,错误发现率一直很低,通常保持≤0.02,这表明在最终模型中选择的变量中,平均每50个中不到1个是假变量.这与错误发现率在0.59和0.72之间的基于Lasso的方法相反,表明在最终模型中选择的变量通常超过60%是错误变量。然而,相比之下,Lasso方法比稳定性选择获得更高的真阳性率,虽然两种方法都取得了良好的效果。对于Lasso方法,真阳性率保持≥0.93,而对于稳定性选择,真阳性率为0.73-0.97.我们的结果表明,这两种方法对于具有聚类结果的高维数据的协变量选择都可能具有价值。当需要高度特异性来识别真正的协变量时,稳定性选择似乎提供了更好的解决方案,虽然灵敏度略有下降。相反,当需要高灵敏度时,套索方法可能有用,即使伴随着特异性的实质性丧失。总的来说,结果表明,与使用Lasso时的特异性损失相比,使用稳定性选择时的敏感性损失相对较小,因此稳定性选择可能为分析人员在评估此类数据时提供更好的选择.
    Covariate selection when the number of available variables is large relative to the number of observations is problematic in epidemiology and remains the focus of continued research. Whilst a variety of statistical methods have been developed to attempt to overcome this issue, at present very few methods are available for wide data that include a clustered outcome. The purpose of this research was to make an empirical evaluation of a new method for covariate selection in wide data settings when the dependent variable is clustered. We used 3300 simulated datasets with a variety of defined structures and known sets of true predictor variables to conduct an empirical evaluation of a mixed model stability selection procedure. Comparison was made with an alternative method based on regularisation using the least absolute shrinkage and selection operator (Lasso) penalty. Model performance was assessed using several metrics including the true positive rate (proportion of true covariates selected in a final model) and false discovery rate (proportion of variables selected in a final model that were non-true (false) variables). For stability selection, the false discovery rate was consistently low, generally remaining ≤ 0.02 indicating that on average fewer than 1 in 50 of the variables selected in a final model were false variables. This was in contrast to the Lasso-based method in which the false discovery rate was between 0.59 and 0.72, indicating that generally more than 60% of variables selected in a final model were false variables. In contrast however, the Lasso method attained higher true positive rates than stability selection, although both methods achieved good results. For the Lasso method, true positive rates remained ≥ 0.93 whereas for stability selection the true positive rate was 0.73-0.97. Our results suggest both methods may be of value for covariate selection with high dimensional data with a clustered outcome. When high specificity is needed for identification of true covariates, stability selection appeared to offer the better solution, although with a slight loss of sensitivity. Conversely when high sensitivity is needed, the Lasso approach may be useful, even if accompanied by a substantial loss of specificity. Overall, the results indicated the loss of sensitivity when employing stability selection is relatively small compared to the loss of specificity when using the Lasso and therefore stability selection may provide the better option for the analyst when evaluating data of this type.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    临床试验中不良事件(AE)数据的统计分析方法最近提出使用相关AE的分组,例如按系统器官类别(SOC)。这些方法开辟了扫描大量AE的可能性,同时控制多重比较,比较不同方法在AE检测和误差率方面的表现,使研究者感兴趣。我们应用两个贝叶斯模型和两个程序来控制错误发现率(FDR),使用AE分组,真实的临床试验安全性数据。我们发现,虽然贝叶斯模型适用于完整的数据集,当消除低发生率AE时,误差控制方法仅给出与贝叶斯方法相似的结果。仿真研究用于比较这些方法的相对性能。我们调查了两种方法在完整试验数据集上的差异,以及去除低发生率不良事件和SOC的数据集。我们发现,虽然低发生率AE的去除增加了误差控制程序的力量,贝叶斯方法的估计能力在所有数据大小上保持相对恒定。然而,自动去除低发生率AE确实对所有方法的错误率有影响,并且需要一种临床指导的方法来去除它们。总的来说,我们发现贝叶斯方法对于扫描收集的大量AE数据特别有用。
    Recent approaches to the statistical analysis of adverse event (AE) data in clinical trials have proposed the use of groupings of related AEs, such as by system organ class (SOC). These methods have opened up the possibility of scanning large numbers of AEs while controlling for multiple comparisons, making the comparative performance of the different methods in terms of AE detection and error rates of interest to investigators. We apply two Bayesian models and two procedures for controlling the false discovery rate (FDR), which use groupings of AEs, to real clinical trial safety data. We find that while the Bayesian models are appropriate for the full data set, the error controlling methods only give similar results to the Bayesian methods when low incidence AEs are removed. A simulation study is used to compare the relative performances of the methods. We investigate the differences between the methods over full trial data sets, and over data sets with low incidence AEs and SOCs removed. We find that while the removal of low incidence AEs increases the power of the error controlling procedures, the estimated power of the Bayesian methods remains relatively constant over all data sizes. Automatic removal of low-incidence AEs however does have an effect on the error rates of all the methods, and a clinically guided approach to their removal is needed. Overall we found that the Bayesian approaches are particularly useful for scanning the large amounts of AE data gathered.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Knockoffs provide a general framework for controlling the false discovery rate when performing variable selection. Much of the Knockoffs literature focuses on theoretical challenges and we recognize a need for bringing some of the current ideas into practice. In this paper we propose a sequential algorithm for generating knockoffs when underlying data consists of both continuous and categorical (factor) variables. Further, we present a heuristic multiple knockoffs approach that offers a practical assessment of how robust the knockoff selection process is for a given dataset. We conduct extensive simulations to validate performance of the proposed methodology. Finally, we demonstrate the utility of the methods on a large clinical data pool of more than 2000 patients with psoriatic arthritis evaluated in four clinical trials with an IL-17A inhibitor, secukinumab (Cosentyx), where we determine prognostic factors of a well established clinical outcome. The analyses presented in this paper could provide a wide range of applications to commonly encountered datasets in medical practice and other fields where variable selection is of particular interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Reproducibility of research findings has been recently questioned in many fields of science, including psychology and neurosciences. One factor influencing reproducibility is the simultaneous testing of multiple hypotheses, which entails false positive findings unless the analyzed p-values are carefully corrected. While this multiple testing problem is well known and studied, it continues to be both a theoretical and practical problem.
    Here we assess reproducibility in simulated experiments in the context of multiple testing. We consider methods that control either the family-wise error rate (FWER) or false discovery rate (FDR), including techniques based on random field theory (RFT), cluster-mass based permutation testing, and adaptive FDR. Several classical methods are also considered. The performance of these methods is investigated under two different models.
    We found that permutation testing is the most powerful method among the considered approaches to multiple testing, and that grouping hypotheses based on prior knowledge can improve power. We also found that emphasizing primary and follow-up studies equally produced most reproducible outcomes.
    We have extended the use of two-group and separate-classes models for analyzing reproducibility and provide a new open-source software \"MultiPy\" for multiple hypothesis testing.
    Our simulations suggest that performing strict corrections for multiple testing is not sufficient to improve reproducibility of neuroimaging experiments. The methods are freely available as a Python toolkit \"MultiPy\" and we aim this study to help in improving statistical data analysis practices and to assist in conducting power and reproducibility analyses for new experiments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    II期临床研究是确定药物成本的关键点,和第二阶段是药物成功的一个糟糕的预测:>30%的药物进入II期研究未能进展,>58%的药物在III期继续失败。自适应临床试验设计已被提出作为一种方法,通过提供早期的徒劳性确定和预测III期成功来降低II期测试的成本。减少整体II期和III期试验规模,缩短整体药物开发时间。这篇综述探讨了第二阶段测试和自适应试验设计中的问题。
    Phase II clinical studies represent a critical point in determining drug costs, and phase II is a poor predictor of drug success: >30% of drugs entering phase II studies fail to progress, and >58% of drugs go on to fail in phase III. Adaptive clinical trial design has been proposed as a way to reduce the costs of phase II testing by providing earlier determination of futility and prediction of phase III success, reducing overall phase II and III trial sizes, and shortening overall drug development time. This review examines issues in phase II testing and adaptive trial design.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    本说明补充并澄清了Hawinkel等人的部分工作。最近发表在杂志上,并提出了一些或多或少的标准工具和方法来进行微生物组的关联研究。
    This note complements and clarifies part of the work of Hawinkel et al. recently published in the journal and suggests some more or less standard tools and methods for carrying out association studies of the microbiome.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    This report presents the results from the 2016 Association of Biomolecular Resource Facilities Proteome Informatics Research Group (iPRG) study on proteoform inference and false discovery rate (FDR) estimation from bottom-up proteomics data. For this study, 3 replicate Q Exactive Orbitrap liquid chromatography-tandom mass spectrometry datasets were generated from each of 4 Escherichia coli samples spiked with different equimolar mixtures of small recombinant proteins selected to mimic pairs of homologous proteins. Participants were given raw data and a sequence file and asked to identify the proteins and provide estimates on the FDR at the proteoform level. As part of this study, we tested a new submission system with a format validator running on a virtual private server (VPS) and allowed methods to be provided as executable R Markdown or IPython Notebooks. The task was perceived as difficult, and only eight unique submissions were received, although those who participated did well with no one method performing best on all samples. However, none of the submissions included a complete Markdown or Notebook, even though examples were provided. Future iPRG studies need to be more successful in promoting and encouraging participation. The VPS and submission validator easily scale to much larger numbers of participants in these types of studies. The unique \"ground-truth\" dataset for proteoform identification generated for this study is now available to the research community, as are the server-side scripts for validating and managing submissions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    An environment-wide association study (EWAS) may be useful to comprehensively test and validate associations between environmental factors and cardiovascular disease (CVD) in an unbiased manner.
    Data from National Health and Nutrition Examination Survey (1999-2014) were randomly 50:50 spilt into training set and testing set. CVD was ascertained by a self-reported diagnosis of myocardial infarction, coronary heart disease or stroke. We performed multiple linear regression analyses associating 203 environmental factors and 132 clinical phenotypes with CVD in training set (false discovery rate < 5%) and significant factors were validated in the testing set (P < 0.05). Random forest (RF) model was used for multicollinearity elimination and variable importance ranking. Discriminative power of factors for CVD was calculated by area under the receiver operating characteristic (AUROC). Overall, 43,568 participants with 4084 (9.4%) CVD were included. After adjusting for age, sex, race, body mass index, blood pressure and socio-economic level, we identified 5 environmental variables and 19 clinical phenotypes associated with CVD in training and testing dataset. Top five factors in RF importance ranking were: waist, glucose, uric acid, and red cell distribution width and glycated hemoglobin. AUROC of the RF model was 0.816 (top 5 factors) and 0.819 (full model). Sensitivity analyses reveal no specific moderators of the associations.
    Our systematic evaluation provides new knowledge on the complex array of environmental correlates of CVD. These identified correlates may serve as a complementary approach to CVD risk assessment. Our findings need to be probed in further observational and interventional studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在全基因组研究中,同时进行成千上万的假设检验。Bonferroni校正和错误发现率(FDR)可以有效地控制I型错误,但通常会产生很高的假阴性率。我们的目标是开发一种更强大的方法来检测差异表达的基因。我们提出了一种加权错误发现率(WFDR)方法,该方法结合了遗传网络中的生物学知识。我们首先使用整合多物种预测(IMP)识别权重,然后在WFDR中应用权重,通过IMP-WFDR算法识别差异表达的基因。我们进行了基因表达实验,以鉴定在全身性铜绿假单胞菌感染期间在砷存在下改变表达的斑马鱼基因。斑马鱼暴露于十亿分之10的砷和/或感染铜绿假单胞菌。包括适当的对照。然后我们在差异表达基因的分析过程中应用了IMP-WFDR。我们比较了每个组的mRNA表达,发现超过200个差异表达基因和几个富集途径,包括防御反应途径。砷反应途径,和Notch信号通路。
    In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号