linear regression

线性回归
  • 文章类型: Journal Article
    背景:准确估计手术时间是一项重要的手术室效率指标。脊柱手术中的当前预测技术包括不太复杂的方法,例如经典的多变量统计模型。机器学习方法已用于预测结果,例如停留时间和恢复正常工作的时间。但没有集中在案件的持续时间。
    目标:这4年的主要目标,单一学术中心,回顾性研究是使用集成学习方法,该方法可能会提高脊柱手术预定病例持续时间的准确性。主要结果指标是病例持续时间。
    方法:我们将使用手术和患者特征的机器学习模型与我们的机构方法进行了比较,根据需要使用历史平均值和外科医生调整。我们实施了多元线性回归,随机森林,装袋,和XGBoost(极限梯度提升),并计算平均R2,均方根误差(RMSE),解释方差,和使用k折交叉验证的平均绝对误差(MAE)。然后,我们使用SHAP(Shapley加法解释)解释器模型来确定特征重要性。
    结果:共纳入3189例接受脊柱手术的患者。机构当前预测病例次数的方法与实际次数的确定系数非常差(R2=0.213)。在k折交叉验证中,线性回归模型的解释方差得分为0.345,R2为0.34,RMSE为162.84分钟,MAE为127.22分钟。在所有型号中,XGBoost回归函数表现最好,解释方差分数为0.778,R2为0.770,RMSE为92.95分钟,MAE为44.31分钟。基于XGBoost回归的SHAP分析,身体质量指数,脊柱融合,外科手术,涉及的脊柱水平数量是对模型影响最大的特征。
    结论:使用基于集成学习的预测模型,特别是XGBoost回归,可以提高脊柱手术次数估计的准确性。
    BACKGROUND: Estimating surgical case duration accurately is an important operating room efficiency metric. Current predictive techniques in spine surgery include less sophisticated approaches such as classical multivariable statistical models. Machine learning approaches have been used to predict outcomes such as length of stay and time returning to normal work, but have not been focused on case duration.
    OBJECTIVE: The primary objective of this 4-year, single-academic-center, retrospective study was to use an ensemble learning approach that may improve the accuracy of scheduled case duration for spine surgery. The primary outcome measure was case duration.
    METHODS: We compared machine learning models using surgical and patient features to our institutional method, which used historic averages and surgeon adjustments as needed. We implemented multivariable linear regression, random forest, bagging, and XGBoost (Extreme Gradient Boosting) and calculated the average R2, root-mean-square error (RMSE), explained variance, and mean absolute error (MAE) using k-fold cross-validation. We then used the SHAP (Shapley Additive Explanations) explainer model to determine feature importance.
    RESULTS: A total of 3189 patients who underwent spine surgery were included. The institution\'s current method of predicting case times has a very poor coefficient of determination with actual times (R2=0.213). On k-fold cross-validation, the linear regression model had an explained variance score of 0.345, an R2 of 0.34, an RMSE of 162.84 minutes, and an MAE of 127.22 minutes. Among all models, the XGBoost regressor performed the best with an explained variance score of 0.778, an R2 of 0.770, an RMSE of 92.95 minutes, and an MAE of 44.31 minutes. Based on SHAP analysis of the XGBoost regression, body mass index, spinal fusions, surgical procedure, and number of spine levels involved were the features with the most impact on the model.
    CONCLUSIONS: Using ensemble learning-based predictive models, specifically XGBoost regression, can improve the accuracy of the estimation of spine surgery times.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    许多国家正在遭受COVID19大流行。确诊病例数,恢复,和死亡是受感染患者人数众多的国家关注的问题。预测这些参数是控制疾病传播和与大流行作斗争的重要途径。这项研究旨在使用时间序列和包括指数平滑和线性回归在内的众所周知的统计预测技术来预测KSA的病例数和死亡人数。该研究扩展到预测主要国家的病例数量,如美国,西班牙,和巴西(有大量污染)来验证所提出的模型(漂移,SES,霍尔特,和ETS)。采用4种评价方法对预测结果进行了验证。结果表明,拟议的ETS(分别为漂移)模型对预测案例数量(分别为死亡)。比较研究,使用KSA的案件数量,表明ETS(RMSE达到18.44)优于最先进的研究(RMSE等于107.54)。拟议的预测模型可以用作任何国家应对这一流行病的基准。
    Many countries are suffering from the COVID19 pandemic. The number of confirmed cases, recovered, and deaths are of concern to the countries having a high number of infected patients. Forecasting these parameters is a crucial way to control the spread of the disease and struggle with the pandemic. This study aimed at forecasting the number of cases and deaths in KSA using time-series and well-known statistical forecasting techniques including Exponential Smoothing and Linear Regression. The study is extended to forecast the number of cases in the main countries such that the US, Spain, and Brazil (having a large number of contamination) to validate the proposed models (Drift, SES, Holt, and ETS). The forecast results were validated using four evaluation measures. The results showed that the proposed ETS (resp. Drift) model is efficient to forecast the number of cases (resp. deaths). The comparison study, using the number of cases in KSA, showed that ETS (with RMSE reaching 18.44) outperforms the state-of-the art studies (with RMSE equal to 107.54). The proposed forecasting model can be used as a benchmark to tackle this pandemic in any country.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore-Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one-factor analysis (FA)-with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation-maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback-Leibler (KL) divergence or the entropy of a random variable.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究检查了HorroGuduruWollega地区降雨的变化和趋势。诸如此类的研究在雨水灌溉农业占主导地位的国家和地区至关重要。用于分析的数据集是从1987年至2016年的埃塞俄比亚国家气象局(NMA)和1987年至2019年的气候危害组红外降水(CHIRPS)获得的。观察到月降雨量的时间变化范围为9.77至141.93%。在CHIRPS数据记录中观察到高变量(CV>30%)和较小变量(CV<20%)的降雨。过去30年和33年的大部分月份的降雨量呈下降趋势。在3个月的季节中,观测到时间变化范围为12.7至75.92%和8.11至43.45%的降雨,分别。过去30年和33年的3个月季节的降雨量呈减少趋势。平均总降雨量为107.203至1016.82毫米,122.8至1147.9毫米,在Belg期间观察到9.163至55.7%和7.831至36.68%的变异性,Kiremt,以及过去30年和33年的贝加季节,分别。在过去30年和33年的这三个季节中,测试了降雨量的减少。在30年中,24个站点的年总降雨量显着差异(P<0.05),变化较小(CV<20%)。在30和33年间,年降雨量下降。在过去的33年中,观察到的平均年代际降雨量从1342.6到1372.8毫米不等(P<0.05)和变化较小(CV<20%)。研究区域经历了降雨,几乎在所有时间尺度上都呈下降趋势。这些可能导致农业生产失败,这需要在不断变化的降雨模式下,在作物日历中制定和实施系统的规划和管理活动。
    This study examined the variability and trend of rainfall over Horro Guduru Wollega Zone. Studies such as this have paramount importance in countries and areas where rain-fed agriculture is predominant. Datasets for analysis were obtained from the National Meteorological Agency of Ethiopia (NMA) from 1987 to 2016 and Climate Hazard Group InfraRed Precipitation (CHIRPS) 1987-2019 with the station portal. Monthly rainfall with temporal variability ranging from 9.77 to 141.93% was observed. High variable (CV > 30%) and less variable (CV < 20%) rainfall was observed in the CHIRPS data records. Rainfall during most months of the last 30 and 33 years showed a decreasing trend. Rainfalls with temporal variability ranging from 12.7 to 75.92% and from 8.11 to 43.45% were observed during the 3-month seasons, respectively. Rainfall over the 3-month seasons of the last 30 and 33 years showed a decreasing trend. The average total rainfall ranging from 107.203 to 1016.82 mm and from 122.8 to 1147.9 mm, with variability from 9.163 to 55.7% and from 7.831 to 36.68% were observed during the Belg, Kiremt, and Bega seasons of the last 30 and 33 years, respectively. A decrease in rainfall was tested over these three seasons of the last 30 and 33 years. Significantly different (P < 0.05) and less variable (CV < 20%) annual total rainfall was recorded at 24 stations over 30 years. Declining annual rainfall was observed over 30 and 33 years. Non-significantly different (P < 0.05) and less variable (CV < 20%) average decadal rainfall ranging from 1342.6 to 1372.8 mm was observed during the last 33 years. The study area had experienced a rainfall with decreasing trends almost over all time scales. These might have been resulting in failure of agricultural production that necessitates developing and implementing systematic planning and management activities in the crop calendar under the face of changing rainfall patterns.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    We present an elective surgery redesign project involving several New Zealand hospitals that is primarily data-driven. One of the project objectives is to improve the predictions of surgery durations. We address this task by considering two approaches: (a) linear regression modelling, and (b) improvement of the data quality. For (a) we evaluate the accuracy of predictions using two performance measures. These predictions are compared to the surgeons\' estimates that may subsequently be adjusted. We demonstrate using the historical surgical lists that the estimates from our prediction techniques improve the scheduling of elective surgeries by minimising the occurrences of list under- and over-runs. For (b), we discuss how the surgical data motivates a review of the surgery procedure classification which takes into account the design of the electronic booking form. The proposed hierarchical classification streamlines the specification of surgery types and therefore retains the potential for improved predictions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    COVID-19的爆发在全球200多个国家造成了灾难性的局势。因此,预测不同国家的疾病未来趋势对于管理疫情可能是有用的。已经完成了一些数据驱动的工作来预测COVID-19病例,这些数据利用过去数据的特征进行未来预测。在这项研究中,机器学习(ML)引导的线性回归模型已用于解决不同类型的COVID-19相关问题。线性回归模型已拟合到数据集中,以处理阳性病例总数,以及马哈拉施特拉邦等印度不同州的复苏数量,西孟加拉邦,喀拉拉邦,德里和阿萨姆邦。从目前对COVID-19数据的分析中可以观察到,每天感染人数的趋势呈线性趋势,然后呈指数增长。此属性已被纳入我们的预测中,分段线性回归是采用此属性的最佳模型。实验结果表明了该方案的优越性,据我们所知,这是预测COVID-19的一种新方法。
    Outbreak of COVID-19, created a disastrous situation in more than 200 countries around the world. Thus the prediction of the future trend of the disease in different countries can be useful for managing the outbreak. Several data driven works have been done for the prediction of COVID-19 cases and these data uses features of past data for future prediction. In this study the machine learning (ML)-guided linear regression model has been used to address the different types of COVID-19 related issues. The linear regression model has been fitted into the dataset to deal with the total number of positive cases, and the number of recoveries for different states in India such as Maharashtra, West Bengal, Kerala, Delhi and Assam. From the current analysis of COVID-19 data it has been observed that trend of per day number of infection follows linearly and then increases exponentially. This property has been incorporated into our prediction and the piecewise linear regression is the best suited model to adopt this property. The experimental results shows the superiority of the proposed scheme and to the best of our knowledge this is a new approach towards the prediction of COVID-19.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    The reliable reconstruction of the temperature conditions at a crime scene is still a great challenge in forensic-entomological case work. Despite many published standards and guidelines for reconstructing temperature, and studies analysing the influence of various factors on the accuracy on such reconstructions, there are astonishingly many cases in the literature in which the temperature at the place of discovery is not reconstructed at all, i.e. the most common method is using the data of the nearest meteorological weather station without any correlation with on-site data. This study summarizes the state of the art in temperature reconstruction from an entomological point of view and compares the application of generalized additive models (GAMs) and linear regression on the basis of hypothetical death scenarios with various post mortem intervals (PMI) and measurement periods for the correlation between crime scene and weather station. We show that GAMs i.e. analysing the potential delay effect of temperature within a day, are the tools of choice because they give better, i.e. more accurate estimations than linear regression in 95,6% of all analysed cases regardless of the PMI, body discovery site and correlation period. Nevertheless, each case and crime scene is unique and therefore each entomological expertise should discuss the possible strengths and weaknesses of its temperature reconstruction. Even if temperature is not or cannot be reconstructed for various reasons, a comparison of on-site data with those of a meteorological weather station is the minimum forensic experts should do.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Intersectional MAIHDA involves applying multilevel models in order to estimate intercategorical inequalities. The approach has been validated thus far using both simulations and empirical applications, and has numerous methodological and theoretical advantages over single-level approaches, including parsimony and reliability for analyzing high-dimensional interactions. In this issue of SSM, Lizotte, Mahendran, Churchill and Bauer (hereafter \"LMCB\") assert that there has been insufficient clarity on the interpretation of fixed effects regression coefficients in intersectional MAIHDA, and that stratum-level residuals in intersectional MAIHDA are not interpretable as interaction effects. We disagree with their second assertion; however, the authors are right to call for greater clarity. For this purpose, in this response we have three main objectives. (1) In their commentary, LMCB incorrectly describe model predictions based on MAIHDA fixed effects as estimates of \"grand means\" (or the mean of means), when they are actually \"precision-weighted grand means.\" We clarify the differences between average predicted values obtained by different models, and argue that predictions obtained by MAIHDA are more suitable to serve as reference points for residual/interaction effects. This further enables us to clarify the interpretation of residual/interaction effects in MAIHDA and conventional models. Using simple simulations, we demonstrate conditions under which the precision-weighted grand mean resembles a grand mean, and when it resembles a population mean (or the mean of all individual observations) obtained using single-level regression, explaining the results obtained by LMCB and informing future research. (2) We construct a modification to MAIHDA that constrains the fixed effects so that the resulting model predictions provide estimates of population means, which we use to demonstrate the robustness of results reported by Evans et al. (2018). We find that stratum-specific residuals obtained using the two approaches are highly correlated (Pearson corr = 0.98, p < 0.0001) and no substantive conclusions would have been affected if the preference had been for estimating population means. However, we advise researchers to use the original, unconstrained MAIHDA. (3) Finally, we outline the extent to which single-level and MAIHDA approaches address the fundamental goals of quantitative intersectional analyses and conclude that intersectional MAIHDA remains a promising new approach for the examination of inequalities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    酸的强度可以通过pKa值来确定。已经尝试找到pKa与无机酸二聚体中的双质子转移(DPT)反应的活化能垒之间的关系。观察到pKa对活化能(Ea)的负面影响,这与pKa的一般惯例相反。已使用具有两个不同基础集的四种不同水平的理论来计算无机酸二聚体中DPT反应的活化能垒。建立了基于一阶和二阶多项式的模型,以找到DPT反应的活化能之间的关系。©2018Wiley期刊,Inc.
    Strength of acid can be determined by means of pKa value. Attempts have been made to find a relationship between pKa and activation energy barrier for a double proton transfer (DPT) reaction in inorganic acid dimers. Negative influence of pKa is observed on activation energy (Ea ) which is contrary to the general convention of pKa . Four different levels of theories with two different basis sets have been used to calculate the activation energy barrier of the DPT reaction in inorganic acid dimers. A model based on first and second order polynomial has been created to find the relationship between activation energy for DPT reaction. © 2018 Wiley Periodicals, Inc.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Comparative Study
    病例对照关联研究通常从其受试者中收集有关次要表型的信息。重复使用数据并研究基因与次级表型之间的关联提供了一种有吸引力且具有成本效益的方法,可以导致发现新的遗传关联。已经提出了许多方法,包括简单且计算有效的临时方法,这些方法忽略了病例控制状态的确定或分层。这些方法的理由依赖于没有协变量的假设以及将原发疾病模型作为逻辑模型的正确规范。两者在实践中可能都不是真的,例如,在存在人口分层或遵循probit模型的原发性疾病模型的情况下。在本文中,我们研究了在存在协变量和可能的疾病模型错误指定的情况下临时方法的有效性.我们表明,在采取临时方法时,可能需要在次级表型模型中包括影响原发性疾病的协变量,即使这些协变量不一定与次级表型相关。我们还表明,当这种疾病罕见时,如果真正的疾病模型遵循probit模型而不是logistic模型,则adhoc方法可能会导致严重的估计和推断偏差。我们的结果在理论上和通过模拟是合理的。应用于与吸烟的遗传关联的真实数据分析,特设方法共同鉴定为来自超过10个基因的高度显著(P<10-5)单核苷酸多态性,在以前的戒烟研究中发现的基因。
    Case-control association studies often collect from their subjects information on secondary phenotypes. Reusing the data and studying the association between genes and secondary phenotypes provide an attractive and cost-effective approach that can lead to discovery of new genetic associations. A number of approaches have been proposed, including simple and computationally efficient ad hoc methods that ignore ascertainment or stratify on case-control status. Justification for these approaches relies on the assumption of no covariates and the correct specification of the primary disease model as a logistic model. Both might not be true in practice, for example, in the presence of population stratification or the primary disease model following a probit model. In this paper, we investigate the validity of ad hoc methods in the presence of covariates and possible disease model misspecification. We show that in taking an ad hoc approach, it may be desirable to include covariates that affect the primary disease in the secondary phenotype model, even though these covariates are not necessarily associated with the secondary phenotype. We also show that when the disease is rare, ad hoc methods can lead to severely biased estimation and inference if the true disease model follows a probit model instead of a logistic model. Our results are justified theoretically and via simulations. Applied to real data analysis of genetic associations with cigarette smoking, ad hoc methods collectively identified as highly significant (P<10-5) single nucleotide polymorphisms from over 10 genes, genes that were identified in previous studies of smoking cessation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号