false discovery rate

  • 文章类型: Journal Article
    Covariate selection when the number of available variables is large relative to the number of observations is problematic in epidemiology and remains the focus of continued research. Whilst a variety of statistical methods have been developed to attempt to overcome this issue, at present very few methods are available for wide data that include a clustered outcome. The purpose of this research was to make an empirical evaluation of a new method for covariate selection in wide data settings when the dependent variable is clustered. We used 3300 simulated datasets with a variety of defined structures and known sets of true predictor variables to conduct an empirical evaluation of a mixed model stability selection procedure. Comparison was made with an alternative method based on regularisation using the least absolute shrinkage and selection operator (Lasso) penalty. Model performance was assessed using several metrics including the true positive rate (proportion of true covariates selected in a final model) and false discovery rate (proportion of variables selected in a final model that were non-true (false) variables). For stability selection, the false discovery rate was consistently low, generally remaining ≤ 0.02 indicating that on average fewer than 1 in 50 of the variables selected in a final model were false variables. This was in contrast to the Lasso-based method in which the false discovery rate was between 0.59 and 0.72, indicating that generally more than 60% of variables selected in a final model were false variables. In contrast however, the Lasso method attained higher true positive rates than stability selection, although both methods achieved good results. For the Lasso method, true positive rates remained ≥ 0.93 whereas for stability selection the true positive rate was 0.73-0.97. Our results suggest both methods may be of value for covariate selection with high dimensional data with a clustered outcome. When high specificity is needed for identification of true covariates, stability selection appeared to offer the better solution, although with a slight loss of sensitivity. Conversely when high sensitivity is needed, the Lasso approach may be useful, even if accompanied by a substantial loss of specificity. Overall, the results indicated the loss of sensitivity when employing stability selection is relatively small compared to the loss of specificity when using the Lasso and therefore stability selection may provide the better option for the analyst when evaluating data of this type.






  • 文章类型: Journal Article
    Recent approaches to the statistical analysis of adverse event (AE) data in clinical trials have proposed the use of groupings of related AEs, such as by system organ class (SOC). These methods have opened up the possibility of scanning large numbers of AEs while controlling for multiple comparisons, making the comparative performance of the different methods in terms of AE detection and error rates of interest to investigators. We apply two Bayesian models and two procedures for controlling the false discovery rate (FDR), which use groupings of AEs, to real clinical trial safety data. We find that while the Bayesian models are appropriate for the full data set, the error controlling methods only give similar results to the Bayesian methods when low incidence AEs are removed. A simulation study is used to compare the relative performances of the methods. We investigate the differences between the methods over full trial data sets, and over data sets with low incidence AEs and SOCs removed. We find that while the removal of low incidence AEs increases the power of the error controlling procedures, the estimated power of the Bayesian methods remains relatively constant over all data sizes. Automatic removal of low-incidence AEs however does have an effect on the error rates of all the methods, and a clinically guided approach to their removal is needed. Overall we found that the Bayesian approaches are particularly useful for scanning the large amounts of AE data gathered.






  • 文章类型: Journal Article
    Knockoffs provide a general framework for controlling the false discovery rate when performing variable selection. Much of the Knockoffs literature focuses on theoretical challenges and we recognize a need for bringing some of the current ideas into practice. In this paper we propose a sequential algorithm for generating knockoffs when underlying data consists of both continuous and categorical (factor) variables. Further, we present a heuristic multiple knockoffs approach that offers a practical assessment of how robust the knockoff selection process is for a given dataset. We conduct extensive simulations to validate performance of the proposed methodology. Finally, we demonstrate the utility of the methods on a large clinical data pool of more than 2000 patients with psoriatic arthritis evaluated in four clinical trials with an IL-17A inhibitor, secukinumab (Cosentyx), where we determine prognostic factors of a well established clinical outcome. The analyses presented in this paper could provide a wide range of applications to commonly encountered datasets in medical practice and other fields where variable selection is of particular interest.






  • 文章类型: Journal Article
    Reproducibility of research findings has been recently questioned in many fields of science, including psychology and neurosciences. One factor influencing reproducibility is the simultaneous testing of multiple hypotheses, which entails false positive findings unless the analyzed p-values are carefully corrected. While this multiple testing problem is well known and studied, it continues to be both a theoretical and practical problem.
    Here we assess reproducibility in simulated experiments in the context of multiple testing. We consider methods that control either the family-wise error rate (FWER) or false discovery rate (FDR), including techniques based on random field theory (RFT), cluster-mass based permutation testing, and adaptive FDR. Several classical methods are also considered. The performance of these methods is investigated under two different models.
    We found that permutation testing is the most powerful method among the considered approaches to multiple testing, and that grouping hypotheses based on prior knowledge can improve power. We also found that emphasizing primary and follow-up studies equally produced most reproducible outcomes.
    We have extended the use of two-group and separate-classes models for analyzing reproducibility and provide a new open-source software \"MultiPy\" for multiple hypothesis testing.
    Our simulations suggest that performing strict corrections for multiple testing is not sufficient to improve reproducibility of neuroimaging experiments. The methods are freely available as a Python toolkit \"MultiPy\" and we aim this study to help in improving statistical data analysis practices and to assist in conducting power and reproducibility analyses for new experiments.






  • 文章类型: Journal Article
    Phase II clinical studies represent a critical point in determining drug costs, and phase II is a poor predictor of drug success: >30% of drugs entering phase II studies fail to progress, and >58% of drugs go on to fail in phase III. Adaptive clinical trial design has been proposed as a way to reduce the costs of phase II testing by providing earlier determination of futility and prediction of phase III success, reducing overall phase II and III trial sizes, and shortening overall drug development time. This review examines issues in phase II testing and adaptive trial design.







  • 文章类型: Journal Article
    This note complements and clarifies part of the work of Hawinkel et al. recently published in the journal and suggests some more or less standard tools and methods for carrying out association studies of the microbiome.






  • 文章类型: Journal Article
    This report presents the results from the 2016 Association of Biomolecular Resource Facilities Proteome Informatics Research Group (iPRG) study on proteoform inference and false discovery rate (FDR) estimation from bottom-up proteomics data. For this study, 3 replicate Q Exactive Orbitrap liquid chromatography-tandom mass spectrometry datasets were generated from each of 4 Escherichia coli samples spiked with different equimolar mixtures of small recombinant proteins selected to mimic pairs of homologous proteins. Participants were given raw data and a sequence file and asked to identify the proteins and provide estimates on the FDR at the proteoform level. As part of this study, we tested a new submission system with a format validator running on a virtual private server (VPS) and allowed methods to be provided as executable R Markdown or IPython Notebooks. The task was perceived as difficult, and only eight unique submissions were received, although those who participated did well with no one method performing best on all samples. However, none of the submissions included a complete Markdown or Notebook, even though examples were provided. Future iPRG studies need to be more successful in promoting and encouraging participation. The VPS and submission validator easily scale to much larger numbers of participants in these types of studies. The unique \"ground-truth\" dataset for proteoform identification generated for this study is now available to the research community, as are the server-side scripts for validating and managing submissions.







  • 文章类型: Journal Article
    An environment-wide association study (EWAS) may be useful to comprehensively test and validate associations between environmental factors and cardiovascular disease (CVD) in an unbiased manner.
    Data from National Health and Nutrition Examination Survey (1999-2014) were randomly 50:50 spilt into training set and testing set. CVD was ascertained by a self-reported diagnosis of myocardial infarction, coronary heart disease or stroke. We performed multiple linear regression analyses associating 203 environmental factors and 132 clinical phenotypes with CVD in training set (false discovery rate < 5%) and significant factors were validated in the testing set (P < 0.05). Random forest (RF) model was used for multicollinearity elimination and variable importance ranking. Discriminative power of factors for CVD was calculated by area under the receiver operating characteristic (AUROC). Overall, 43,568 participants with 4084 (9.4%) CVD were included. After adjusting for age, sex, race, body mass index, blood pressure and socio-economic level, we identified 5 environmental variables and 19 clinical phenotypes associated with CVD in training and testing dataset. Top five factors in RF importance ranking were: waist, glucose, uric acid, and red cell distribution width and glycated hemoglobin. AUROC of the RF model was 0.816 (top 5 factors) and 0.819 (full model). Sensitivity analyses reveal no specific moderators of the associations.
    Our systematic evaluation provides new knowledge on the complex array of environmental correlates of CVD. These identified correlates may serve as a complementary approach to CVD risk assessment. Our findings need to be probed in further observational and interventional studies.






  • 文章类型: Journal Article
    In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).






  • 文章类型: Journal Article
    In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.





