variable selection

  • 文章类型: Journal Article
    This manuscript summarizes a presentation delivered by the first author at the 2024 symposium for the Calvin Schwabe Award for Lifetime Achievement in Veterinary Epidemiology and Preventive Medicine, which was awarded to Dr. Jan Sargeant. Epidemiologic research plays a crucial role in understanding the complex relationships between exposures and health outcomes. However, the accuracy of the conclusions drawn from these investigations relies upon the meticulous selection and measurement of exposure variables. Appropriate exposure variable selection is crucial for understanding disease etiologies, but it is often the case that we are not able to directly measure the exposure variable of interest and use proxy measures to assess exposures instead. Inappropriate use of proxy measures can lead to erroneous conclusions being made about the true exposure of interest. These errors may lead to biased estimates of associations between exposures and outcomes. The consequences of such biases extend beyond research concerns as health decisions can be made based on flawed evidence. Recognizing and mitigating these biases are essential for producing reliable evidence that informs health policies and interventions, ultimately contributing to improved population health outcomes. To address these challenges, researchers must adopt rigorous methodologies for exposure variable selection and validation studies to minimize measurement errors.






  • 文章类型: Journal Article
    Neural demyelination and brain damage accumulated in white matter appear as hyperintense areas on T2-weighted MRI scans in the form of lesions. Modeling binary images at the population level, where each voxel represents the existence of a lesion, plays an important role in understanding aging and inflammatory diseases. We propose a scalable hierarchical Bayesian spatial model, called BLESS, capable of handling binary responses by placing continuous spike-and-slab mixture priors on spatially-varying parameters and enforcing spatial dependency on the parameter dictating the amount of sparsity within the probability of inclusion. The use of mean-field variational inference with dynamic posterior exploration, which is an annealing-like strategy that improves optimization, allows our method to scale to large sample sizes. Our method also accounts for underestimation of posterior variance due to variational inference by providing an approximate posterior sampling approach based on Bayesian bootstrap ideas and spike-and-slab priors with random shrinkage targets. Besides accurate uncertainty quantification, this approach is capable of producing novel cluster size based imaging statistics, such as credible intervals of cluster size, and measures of reliability of cluster occurrence. Lastly, we validate our results via simulation studies and an application to the UK Biobank, a large-scale lesion mapping study with a sample size of 40,000 subjects.






  • 文章类型: Journal Article
    Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and presentation of the modeling results that lacks clarity. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. Initial data analysis (IDA) is prerequisite to regression analyses as it provides knowledge about the data needed to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.






  • 文章类型: Journal Article
    This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confounding effects and a sequential procedure to estimate the causal order of variables. We implement DeFuSE via feedforward neural networks for scalable computation. Moreover, we establish the consistency of DeFuSE under an assumption called the strong causal minimality. In simulations, DeFuSE compares favorably against state-of-the-art competitors that ignore confounding or nonlinearity. Finally, we demonstrate the utility and effectiveness of the proposed approach with an application to gene regulatory network analysis. The Python implementation is available at






  • 文章类型: Journal Article
    We consider unsupervised classification by means of a latent multinomial variable which categorizes a scalar response into one of the L components of a mixture model which incorporates scalar and functional covariates. This process can be thought as a hierarchical model with the first level modelling a scalar response according to a mixture of parametric distributions and the second level modelling the mixture probabilities by means of a generalized linear model with functional and scalar covariates. The traditional approach of treating functional covariates as vectors not only suffers from the curse of dimensionality, since functional covariates can be measured at very small intervals leading to a highly parametrized model, but also does not take into account the nature of the data. We use basis expansions to reduce the dimensionality and a Bayesian approach for estimating the parameters while providing predictions of the latent classification vector. The method is motivated by two data examples that are not easily handled by existing methods. The first example concerns identifying placebo responders on a clinical trial (normal mixture model) and the other predicting illness for milking cows (zero-inflated mixture of the Poisson model).






  • 文章类型: Journal Article
    Laser-induced breakdown spectroscopy (LIBS) and visible near-infrared spectroscopy (vis-NIRS) are spectroscopic techniques that offer promising alternatives to traditional laboratory methods for the rapid and cost-effective determination of soil properties on a large scale. Despite their individual limitations, combining LIBS and vis-NIRS has been shown to enhance the prediction accuracy for the determination of soil properties compared to single-sensor approaches. In this study, we used a comprehensive Danish national-scale soil dataset encompassing mostly sandy soils collected from various land uses and soil depths to evaluate the performance of LIBS and vis-NIRS, as well as their combined spectra, in predicting soil organic carbon (SOC) and texture. Firstly, partial least squares regression (PLSR) models were developed to correlate both LIBS and vis-NIRS spectra with the reference data. Subsequently, we merged LIBS and vis-NIRS data and developed PLSR models for the combined spectra. Finally, interval partial least squares regression (iPLSR) models were applied to assess the impact of variable selection on prediction accuracy for both LIBS and vis-NIRS. Despite being fundamentally different techniques, LIBS and vis-NIRS displayed comparable prediction performance for the investigated soil properties. LIBS achieved a root mean square error of prediction (RMSEP) of <7% for texture and 0.5% for SOC, while vis-NIRS achieved an RMSEP of <8% for texture and 0.5% for SOC. Combining LIBS and vis-NIRS spectra improved the prediction accuracy by 16% for clay, 6% for silt and sand, and 2% for SOC compared to single-sensor LIBS predictions. On the other hand, vis-NIRS single-sensor predictions were improved by 10% for clay, 17% for silt, 16% for sand, and 4% for SOC. Furthermore, applying iPLSR for variable selection improved prediction accuracy for both LIBS and vis-NIRS. Compared to LIBS PLSR predictions, iPLSR achieved reductions of 27% and 17% in RMSEP for clay and sand prediction, respectively, and an 8% reduction for silt and SOC prediction. Similarly, vis-NIRS iPLSR models demonstrated reductions of 6% and 4% in RMSEP for clay and SOC, respectively, and a 3% reduction for silt and sand. Interestingly, LIBS iPLSR models outperformed combined LIBS-vis-NIRS models in terms of prediction accuracy. Although combining LIBS and vis-NIRS improved the prediction accuracy of texture and SOC, LIBS coupled with variable selection had a greater benefit in terms of prediction accuracy. Future studies should investigate the influence of reference method uncertainty on prediction accuracy.






  • 文章类型: Journal Article
    Background: The prediction of patients\' outcomes is a key component in personalized medicine. Oftentimes, a prediction model is developed using a large number of candidate predictors, called high-dimensional data, including genomic data, lab tests, electronic health records, etc. Variable selection, also called dimension reduction, is a critical step in developing a prediction model using high-dimensional data. Methods: In this paper, we compare the variable selection and prediction performance of popular machine learning (ML) methods with our proposed method. LASSO is a popular ML method that selects variables by imposing an L1-norm penalty to the likelihood. By this approach, LASSO selects features based on the size of regression estimates, rather than their statistical significance. As a result, LASSO can miss significant features while it is known to over-select features. Elastic net (EN), another popular ML method, tends to select even more features than LASSO since it uses a combination of L1- and L2-norm penalties that is less strict than an L1-norm penalty. Insignificant features included in a fitted prediction model act like white noises, so that the fitted model will lose prediction accuracy. Furthermore, for the future use of a fitted prediction model, we have to collect the data of all the features included in the model, which will cost a lot and possibly lower the accuracy of the data if the number of features is too many. Therefore, we propose an ML method, called repeated sieving, extending the standard regression methods with stepwise variable selection. By selecting features based on their statistical significance, it resolves the over-selection issue with high-dimensional data. Results: Through extensive numerical studies and real data examples, our results show that the repeated sieving method selects far fewer features than LASSO and EN, but has higher prediction accuracy than the existing ML methods. Conclusions: We conclude that our repeated sieving method performs well in both variable selection and prediction, and it saves the cost of future investigation on the selected factors.






  • 文章类型: Journal Article
    With the development of machine learning and artificial intelligence (ML/AI) models, data-driven soft sensors, especially the neural network-based, have widespread utilization for the prediction of key water quality indicators in wastewater treatment plants (WWTPs). However, recent research indicates that the prediction performance and computational efficiency are greatly compromised due to the time-varying, nonlinear and high-dimensional nature of the wastewater treatment process. This paper proposes a neural network-based soft sensor with double-errors parallel optimization to achieve more accurate prediction for effluent variables timely. Firstly, relying on the Activity Based Classification (ABC) principle, an ensemble variable selection method that combines Pearson correlation coefficient (PCC) and mutual information (MI) is introduced to select the optimal process variables as auxiliary variables, thereby reducing the data dimensionality and simplifying the model complexity. Subsequently, a double-errors parallel optimization methodology with minimizing both point prediction error and distribution error simultaneously is proposed, aiming to enhancing the training efficiency and the fitting quality of neural networks. Finally, the effectiveness is quantitatively assessed in two datasets collected from the Benchmark Simulation Model no. 1 (BMS1) and an actual oxidation ditch WWTP. The experimental results illustrate that the proposed soft sensor achieves precise effluent variable prediction, with RMSE, MAE and R2 values being 0.0606, 0.0486, 0.99930, and 0.06939, 0.05381, 0.98040, respectively. Consequently, this soft sensor can expedite the convergence speed in the neural network training process and enhance the prediction performance, thereby contributing to the effective optimization management of WWTPs.






  • 文章类型: Journal Article
    Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the \"small n $$ n $$ and large p $$ p $$ \" challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify \"prior information\". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.






  • 文章类型: Journal Article
    Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modeling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximization algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases.





