linear regression

  • 文章类型: Journal Article
    BACKGROUND: Estimating surgical case duration accurately is an important operating room efficiency metric. Current predictive techniques in spine surgery include less sophisticated approaches such as classical multivariable statistical models. Machine learning approaches have been used to predict outcomes such as length of stay and time returning to normal work, but have not been focused on case duration.
    OBJECTIVE: The primary objective of this 4-year, single-academic-center, retrospective study was to use an ensemble learning approach that may improve the accuracy of scheduled case duration for spine surgery. The primary outcome measure was case duration.
    METHODS: We compared machine learning models using surgical and patient features to our institutional method, which used historic averages and surgeon adjustments as needed. We implemented multivariable linear regression, random forest, bagging, and XGBoost (Extreme Gradient Boosting) and calculated the average R2, root-mean-square error (RMSE), explained variance, and mean absolute error (MAE) using k-fold cross-validation. We then used the SHAP (Shapley Additive Explanations) explainer model to determine feature importance.
    RESULTS: A total of 3189 patients who underwent spine surgery were included. The institution\'s current method of predicting case times has a very poor coefficient of determination with actual times (R2=0.213). On k-fold cross-validation, the linear regression model had an explained variance score of 0.345, an R2 of 0.34, an RMSE of 162.84 minutes, and an MAE of 127.22 minutes. Among all models, the XGBoost regressor performed the best with an explained variance score of 0.778, an R2 of 0.770, an RMSE of 92.95 minutes, and an MAE of 44.31 minutes. Based on SHAP analysis of the XGBoost regression, body mass index, spinal fusions, surgical procedure, and number of spine levels involved were the features with the most impact on the model.
    CONCLUSIONS: Using ensemble learning-based predictive models, specifically XGBoost regression, can improve the accuracy of the estimation of spine surgery times.






  • 文章类型: Journal Article
    Many countries are suffering from the COVID19 pandemic. The number of confirmed cases, recovered, and deaths are of concern to the countries having a high number of infected patients. Forecasting these parameters is a crucial way to control the spread of the disease and struggle with the pandemic. This study aimed at forecasting the number of cases and deaths in KSA using time-series and well-known statistical forecasting techniques including Exponential Smoothing and Linear Regression. The study is extended to forecast the number of cases in the main countries such that the US, Spain, and Brazil (having a large number of contamination) to validate the proposed models (Drift, SES, Holt, and ETS). The forecast results were validated using four evaluation measures. The results showed that the proposed ETS (resp. Drift) model is efficient to forecast the number of cases (resp. deaths). The comparison study, using the number of cases in KSA, showed that ETS (with RMSE reaching 18.44) outperforms the state-of-the art studies (with RMSE equal to 107.54). The proposed forecasting model can be used as a benchmark to tackle this pandemic in any country.






  • 文章类型: Journal Article
    Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore-Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one-factor analysis (FA)-with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation-maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback-Leibler (KL) divergence or the entropy of a random variable.







  • 文章类型: Journal Article
    This study examined the variability and trend of rainfall over Horro Guduru Wollega Zone. Studies such as this have paramount importance in countries and areas where rain-fed agriculture is predominant. Datasets for analysis were obtained from the National Meteorological Agency of Ethiopia (NMA) from 1987 to 2016 and Climate Hazard Group InfraRed Precipitation (CHIRPS) 1987-2019 with the station portal. Monthly rainfall with temporal variability ranging from 9.77 to 141.93% was observed. High variable (CV > 30%) and less variable (CV < 20%) rainfall was observed in the CHIRPS data records. Rainfall during most months of the last 30 and 33 years showed a decreasing trend. Rainfalls with temporal variability ranging from 12.7 to 75.92% and from 8.11 to 43.45% were observed during the 3-month seasons, respectively. Rainfall over the 3-month seasons of the last 30 and 33 years showed a decreasing trend. The average total rainfall ranging from 107.203 to 1016.82 mm and from 122.8 to 1147.9 mm, with variability from 9.163 to 55.7% and from 7.831 to 36.68% were observed during the Belg, Kiremt, and Bega seasons of the last 30 and 33 years, respectively. A decrease in rainfall was tested over these three seasons of the last 30 and 33 years. Significantly different (P < 0.05) and less variable (CV < 20%) annual total rainfall was recorded at 24 stations over 30 years. Declining annual rainfall was observed over 30 and 33 years. Non-significantly different (P < 0.05) and less variable (CV < 20%) average decadal rainfall ranging from 1342.6 to 1372.8 mm was observed during the last 33 years. The study area had experienced a rainfall with decreasing trends almost over all time scales. These might have been resulting in failure of agricultural production that necessitates developing and implementing systematic planning and management activities in the crop calendar under the face of changing rainfall patterns.






  • 文章类型: Journal Article
    We present an elective surgery redesign project involving several New Zealand hospitals that is primarily data-driven. One of the project objectives is to improve the predictions of surgery durations. We address this task by considering two approaches: (a) linear regression modelling, and (b) improvement of the data quality. For (a) we evaluate the accuracy of predictions using two performance measures. These predictions are compared to the surgeons\' estimates that may subsequently be adjusted. We demonstrate using the historical surgical lists that the estimates from our prediction techniques improve the scheduling of elective surgeries by minimising the occurrences of list under- and over-runs. For (b), we discuss how the surgical data motivates a review of the surgery procedure classification which takes into account the design of the electronic booking form. The proposed hierarchical classification streamlines the specification of surgery types and therefore retains the potential for improved predictions.






  • 文章类型: Journal Article
    Outbreak of COVID-19, created a disastrous situation in more than 200 countries around the world. Thus the prediction of the future trend of the disease in different countries can be useful for managing the outbreak. Several data driven works have been done for the prediction of COVID-19 cases and these data uses features of past data for future prediction. In this study the machine learning (ML)-guided linear regression model has been used to address the different types of COVID-19 related issues. The linear regression model has been fitted into the dataset to deal with the total number of positive cases, and the number of recoveries for different states in India such as Maharashtra, West Bengal, Kerala, Delhi and Assam. From the current analysis of COVID-19 data it has been observed that trend of per day number of infection follows linearly and then increases exponentially. This property has been incorporated into our prediction and the piecewise linear regression is the best suited model to adopt this property. The experimental results shows the superiority of the proposed scheme and to the best of our knowledge this is a new approach towards the prediction of COVID-19.







  • 文章类型: Journal Article
    The reliable reconstruction of the temperature conditions at a crime scene is still a great challenge in forensic-entomological case work. Despite many published standards and guidelines for reconstructing temperature, and studies analysing the influence of various factors on the accuracy on such reconstructions, there are astonishingly many cases in the literature in which the temperature at the place of discovery is not reconstructed at all, i.e. the most common method is using the data of the nearest meteorological weather station without any correlation with on-site data. This study summarizes the state of the art in temperature reconstruction from an entomological point of view and compares the application of generalized additive models (GAMs) and linear regression on the basis of hypothetical death scenarios with various post mortem intervals (PMI) and measurement periods for the correlation between crime scene and weather station. We show that GAMs i.e. analysing the potential delay effect of temperature within a day, are the tools of choice because they give better, i.e. more accurate estimations than linear regression in 95,6% of all analysed cases regardless of the PMI, body discovery site and correlation period. Nevertheless, each case and crime scene is unique and therefore each entomological expertise should discuss the possible strengths and weaknesses of its temperature reconstruction. Even if temperature is not or cannot be reconstructed for various reasons, a comparison of on-site data with those of a meteorological weather station is the minimum forensic experts should do.






  • 文章类型: Journal Article
    Intersectional MAIHDA involves applying multilevel models in order to estimate intercategorical inequalities. The approach has been validated thus far using both simulations and empirical applications, and has numerous methodological and theoretical advantages over single-level approaches, including parsimony and reliability for analyzing high-dimensional interactions. In this issue of SSM, Lizotte, Mahendran, Churchill and Bauer (hereafter \"LMCB\") assert that there has been insufficient clarity on the interpretation of fixed effects regression coefficients in intersectional MAIHDA, and that stratum-level residuals in intersectional MAIHDA are not interpretable as interaction effects. We disagree with their second assertion; however, the authors are right to call for greater clarity. For this purpose, in this response we have three main objectives. (1) In their commentary, LMCB incorrectly describe model predictions based on MAIHDA fixed effects as estimates of \"grand means\" (or the mean of means), when they are actually \"precision-weighted grand means.\" We clarify the differences between average predicted values obtained by different models, and argue that predictions obtained by MAIHDA are more suitable to serve as reference points for residual/interaction effects. This further enables us to clarify the interpretation of residual/interaction effects in MAIHDA and conventional models. Using simple simulations, we demonstrate conditions under which the precision-weighted grand mean resembles a grand mean, and when it resembles a population mean (or the mean of all individual observations) obtained using single-level regression, explaining the results obtained by LMCB and informing future research. (2) We construct a modification to MAIHDA that constrains the fixed effects so that the resulting model predictions provide estimates of population means, which we use to demonstrate the robustness of results reported by Evans et al. (2018). We find that stratum-specific residuals obtained using the two approaches are highly correlated (Pearson corr = 0.98, p < 0.0001) and no substantive conclusions would have been affected if the preference had been for estimating population means. However, we advise researchers to use the original, unconstrained MAIHDA. (3) Finally, we outline the extent to which single-level and MAIHDA approaches address the fundamental goals of quantitative intersectional analyses and conclude that intersectional MAIHDA remains a promising new approach for the examination of inequalities.






  • 文章类型: Journal Article
    Strength of acid can be determined by means of pKa value. Attempts have been made to find a relationship between pKa and activation energy barrier for a double proton transfer (DPT) reaction in inorganic acid dimers. Negative influence of pKa is observed on activation energy (Ea ) which is contrary to the general convention of pKa . Four different levels of theories with two different basis sets have been used to calculate the activation energy barrier of the DPT reaction in inorganic acid dimers. A model based on first and second order polynomial has been created to find the relationship between activation energy for DPT reaction. © 2018 Wiley Periodicals, Inc.






  • 文章类型: Comparative Study
    Case-control association studies often collect from their subjects information on secondary phenotypes. Reusing the data and studying the association between genes and secondary phenotypes provide an attractive and cost-effective approach that can lead to discovery of new genetic associations. A number of approaches have been proposed, including simple and computationally efficient ad hoc methods that ignore ascertainment or stratify on case-control status. Justification for these approaches relies on the assumption of no covariates and the correct specification of the primary disease model as a logistic model. Both might not be true in practice, for example, in the presence of population stratification or the primary disease model following a probit model. In this paper, we investigate the validity of ad hoc methods in the presence of covariates and possible disease model misspecification. We show that in taking an ad hoc approach, it may be desirable to include covariates that affect the primary disease in the secondary phenotype model, even though these covariates are not necessarily associated with the secondary phenotype. We also show that when the disease is rare, ad hoc methods can lead to severely biased estimation and inference if the true disease model follows a probit model instead of a logistic model. Our results are justified theoretically and via simulations. Applied to real data analysis of genetic associations with cigarette smoking, ad hoc methods collectively identified as highly significant (P<10-5) single nucleotide polymorphisms from over 10 genes, genes that were identified in previous studies of smoking cessation.





