penalized logistic regression

  • 文章类型: Journal Article
    背景:由于多重耐药生物体(MDROs)引起的医疗保健相关感染,如耐甲氧西林金黄色葡萄球菌(MRSA)和艰难梭菌(CDI),给我们的医疗基础设施带来沉重负担。
    目的:MDROs的筛查是防止传播的重要机制,但却是资源密集型的。这项研究的目的是开发可以使用电子健康记录(EHR)数据预测定植或感染风险的自动化工具,提供有用的信息来帮助感染控制,并指导经验性抗生素覆盖。
    方法:我们回顾性地开发了一个机器学习模型来检测在弗吉尼亚大学医院住院患者样本采集时未分化患者的MRSA定植和感染。我们使用来自患者EHR数据的入院和住院期间信息的临床和非临床特征来构建模型。此外,我们在EHR数据中使用了一类从联系网络派生的特征;这些网络特征可以捕获患者与提供者和其他患者的联系,提高预测MRSA监测试验结果的模型可解释性和准确性。最后,我们探索了不同患者亚群的异质模型,例如,入住重症监护病房或急诊科的人或有特定检测史的人,哪个表现更好。
    结果:我们发现惩罚逻辑回归比其他方法表现更好,当我们使用多项式(二次)变换特征时,该模型的性能根据其接收器操作特征-曲线下面积得分提高了近11%。预测MDRO风险的一些重要特征包括抗生素使用,手术,使用设备,透析,患者的合并症状况,和网络特征。其中,网络功能增加了最大的价值,并将模型的性能提高了至少15%。对于特定患者亚群,具有相同特征转换的惩罚逻辑回归模型也比其他模型表现更好。
    结论:我们的研究表明,使用来自EHR数据的临床和非临床特征,通过机器学习方法可以非常有效地进行MRSA风险预测。网络特征是最具预测性的,并且提供优于现有方法的显著改进。此外,不同患者亚群的异质预测模型提高了模型的性能。
    BACKGROUND: Health care-associated infections due to multidrug-resistant organisms (MDROs), such as methicillin-resistant Staphylococcus aureus (MRSA) and Clostridioides difficile (CDI), place a significant burden on our health care infrastructure.
    OBJECTIVE: Screening for MDROs is an important mechanism for preventing spread but is resource intensive. The objective of this study was to develop automated tools that can predict colonization or infection risk using electronic health record (EHR) data, provide useful information to aid infection control, and guide empiric antibiotic coverage.
    METHODS: We retrospectively developed a machine learning model to detect MRSA colonization and infection in undifferentiated patients at the time of sample collection from hospitalized patients at the University of Virginia Hospital. We used clinical and nonclinical features derived from on-admission and throughout-stay information from the patient\'s EHR data to build the model. In addition, we used a class of features derived from contact networks in EHR data; these network features can capture patients\' contacts with providers and other patients, improving model interpretability and accuracy for predicting the outcome of surveillance tests for MRSA. Finally, we explored heterogeneous models for different patient subpopulations, for example, those admitted to an intensive care unit or emergency department or those with specific testing histories, which perform better.
    RESULTS: We found that the penalized logistic regression performs better than other methods, and this model\'s performance measured in terms of its receiver operating characteristics-area under the curve score improves by nearly 11% when we use polynomial (second-degree) transformation of the features. Some significant features in predicting MDRO risk include antibiotic use, surgery, use of devices, dialysis, patient\'s comorbidity conditions, and network features. Among these, network features add the most value and improve the model\'s performance by at least 15%. The penalized logistic regression model with the same transformation of features also performs better than other models for specific patient subpopulations.
    CONCLUSIONS: Our study shows that MRSA risk prediction can be conducted quite effectively by machine learning methods using clinical and nonclinical features derived from EHR data. Network features are the most predictive and provide significant improvement over prior methods. Furthermore, heterogeneous prediction models for different patient subpopulations enhance the model\'s performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 印度国家心理健康调查(NMHS)是一项开创性的全国性研究,利用制服,混合定量和定性方法的标准化方法。涵盖来自不同地区的12个州的数据,它的任务是评估精神疾病的患病率,桥梁处理间隙,探索服务利用,并评估这些条件的社会经济影响。这一举措为印度心理健康的复杂局面提供了关键的见解。计划对NMHS数据进行的分析之一是进行逻辑回归分析,目的是弄清各种社会人口统计学因素如何影响特定精神疾病的存在与否。在这种追求中,两个重大挑战迫在眉睫。第一个涉及数据分离,可能扰乱参数估计的复杂性。第二个挑战源于患病率较低的疾病的存在,这导致了有限密度的数据集,可能会破坏我们分析的统计可靠性。为了应对这些数据驱动的障碍,NMHS认识到替代传统逻辑回归的关键必要性,一个可以巧妙地驾驭这些复杂性的人,确保从收集的数据中获得可靠可靠的见解。传统逻辑回归,一种广泛流行的二元结果建模方法,有其局限性,特别是当面对有限的数据集和罕见的结果时。这里,“完全分离”的问题会导致传统逻辑回归估计的收敛失败,处理二进制变量时经常遇到的难题。Firth的惩罚逻辑回归成为应对这些挑战的有效解决方案,有效缓解源于小样本量的分析偏见,罕见事件,完全分离。本文试图阐明Firth方法在科学研究中管理小数据集方面的卓越功效,并倡导其更广泛的应用。我们简要介绍了Firth\的方法,强调其相对于替代分析方法的独特优势,并强调其在NMHS2015-2016年数据中的应用,特别是对于患病率较低的疾病。
    The National Mental Health Survey of India (NMHS) was a ground-breaking nationwide study that harnessed a uniform, standardized methodology blending quantitative and qualitative approaches. Covering data from 12 states across diverse regions, its mission was to gauge the prevalence of psychiatric disorders, bridge treatment gaps, explore service utilization, and gauge the socioeconomic repercussions of these conditions. This initiative provided pivotal insights into the intricate landscape of mental health in India. One of the analyses planned for NMHS data was to undertake a logistic regression analysis with an aim to unravel how various sociodemographic factors influence the presence or absence of specific psychiatric disorders. Within this pursuit, two substantial challenges loomed. The first pertained to data separation, a complication that could perturb parameter estimation. The second challenge stemmed from the existence of disorders with lower prevalence rates, which resulted in datasets of limited density, potentially undermining the statistical reliability of our analysis. In response to these data-driven hurdles, NMHS recognized the critical necessity for an alternative to conventional logistic regression, one that could adeptly navigate these complexities, ensuring robust and dependable insights from the collected data. Traditional logistic regression, a widely prevalent method for modeling binary outcomes, has its limitations, especially when faced with limited datasets and rare outcomes. Here, the problem of \"complete separation\" can lead to convergence failure in traditional logistic regression estimations, a conundrum frequently encountered when handling binary variables. Firth\'s penalized logistic regression emerges as a potent solution to these challenges, effectively mitigating analytical biases rooted in small sample sizes, rare events, and complete separation. This article endeavors to illuminate the superior efficacy of Firth\'s method in managing small datasets within scientific research and advocates for its more widespread application. We provide a succinct introduction to Firth\'s method, emphasizing its distinct advantages over alternative analytical approaches and underscoring its application to data from the NMHS 2015-2016, particularly for disorders with lower prevalence.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    癌症分类和基因选择是DNA微阵列基因表达数据分析中的重要应用。由于DNA微阵列数据存在高维问题,自动基因选择方法用于提高专家分类器系统的分类性能。在本文中,讨论了一种新的惩罚逻辑回归方法,该方法可以在DNA微阵列数据中同时进行基因系数估计和变量选择。该方法利用基因系数的先验信息来提高基础模型的分类精度。给出了带有筛选规则的坐标下降算法,以有效地获得该方法的基因系数估计。使用曲线下的面积在五个高维癌症分类数据集上检查了该方法的性能,选择的基因的数量,误分类率和F分数度量。实际数据分析结果表明,该方法具有较好的癌症分类性能,误分类率较小,通过权衡基础模型的一些稀疏性水平,曲线下的大面积区域和F分数。因此,所提出的方法可以看作是高维癌症分类范围内可靠的惩罚逻辑回归方法。
    Cancer classification and gene selection are important applications in DNA microarray gene expression data analysis. Since DNA microarray data suffers from the high-dimensionality problem, automatic gene selection methods are used to enhance the classification performance of expert classifier systems. In this paper, a new penalized logistic regression method that performs simultaneous gene coefficient estimation and variable selection in DNA microarray data is discussed. The method employs prior information about the gene coefficients to improve the classification accuracy of the underlying model. The coordinate descent algorithm with screening rules is given to obtain the gene coefficient estimates of the proposed method efficiently. The performance of the method is examined on five high-dimensional cancer classification datasets using the area under the curve, the number of selected genes, misclassification rate and F-score measures. The real data analysis results indicate that the proposed method achieves a good cancer classification performance with a small misclassification rate, large area under the curve and F-score by trading off some sparsity level of the underlying model. Hence, the proposed method can be seen as a reliable penalized logistic regression method in the scope of high-dimensional cancer classification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations.
    In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth\'s correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included \'oracle\' models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known.
    Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors.
    Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    坡道计量缓解了交通拥堵,减少延迟,并保持高速公路的通行能力。由于其运行机制,坡道计量还可以提高高速公路的安全性。虽然斜坡计量的操作效益已被广泛量化,对其安全影响的研究较少。这项研究的重点是评估匝道计量对高速公路干线安全性能的影响。当激活坡道计量时,它为入口坡道下游的路段开发了碰撞风险预测模型。这项研究是基于迈阿密沿I-95的带有全系统坡道计量的走廊,佛罗里达实时流量,崩溃,和2016年至2018年收集的匝道计量操作数据用于分析。该研究采用了匹配的碰撞和非碰撞案例方法来评估启动和停用坡道仪表时的碰撞风险。使用Bootstrap重采样技术开发了惩罚逻辑回归模型,以估计坡道计量激活的影响,并选择可以预测坡道计量仪激活时碰撞风险的重要变量。结果表明,坡道计量通过降低入口坡道下游的撞车风险来提高高速公路走廊的安全性。在斜坡计量激活期间,可以使用上游和下游探测器之间的平均车道速度差来预测5分钟后入口坡道下游路段的碰撞风险。下游和上游探测器车道的平均交通量,以及上游探测器中车道之间的速度变化系数。此外,下游占用的变异系数可以预测15分钟后的撞车风险。研究结果可供运输机构在评估匝道表的部署时使用。此外,开发的碰撞风险预测模型可以实时用于帮助机构识别增加的碰撞风险,并向上游交通提供适当的警告信息。
    Ramp metering relieves traffic congestion, reduces delay, and maintains the capacity flow on freeways. Due to its operational mechanism, ramp metering can also improve freeway safety. While the operational benefits of ramp metering have extensively been quantified, research on its safety effects is sparse. This study focused on evaluating the effects of ramp metering on the safety performance of the freeway mainline. It developed a crash risk prediction model for segments downstream of the entrance ramps when ramp metering is activated. The study was based on a corridor with system-wide ramp metering along I-95 in Miami, Florida. Real-time traffic, crash, and ramp metering operations data collected from 2016 to 2018 were used in the analysis. The study adopted a matched crash and non-crash case approach to evaluate the crash risk when ramp meters were activated and deactivated. A penalized logistic regression model was developed using a bootstrap resampling technique to estimate the effects of ramp metering activation and select important variables that could predict crash risk when ramp meters were activated. Results indicated that ramp metering improves safety along the freeway corridor by reducing the crash risk downstream of the entrance ramps. During ramp metering activation, the crash risk on segments downstream of the entrance ramps 5 min later can be predicted using the difference in the average lane speeds between upstream and downstream detectors, the average traffic volume in the lanes at the downstream and upstream detectors, and the coefficient of variation of speed between lanes in the upstream detectors. Also, the coefficient of variation of occupancy downstream could predict the crash risk 15 min later. The study results could be used by transportation agencies when evaluating the deployment of ramp meters. Moreover, the developed crash risk prediction model could be used in real-time to help agencies identify the increased crash risk and provide appropriate warning information to the upstream traffic.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    青少年抑郁症的发作与长期的负面后果有关。确定有患抑郁症风险的青少年将能够监测风险因素并制定早期干预策略。使用机器学习来结合来自多种模式的几个风险因素可能允许在个体水平上预测抑郁症的发作。
    青少年多点纵向研究的子样本,IMAGEN研究,用于预测健康青少年未来(亚阈值)重度抑郁症的发作。根据2年和5年的随访数据,参与者被分组为:1)诊断为重度抑郁障碍或阈下重度抑郁障碍的参与者和2)健康对照受试者.来自不同模态的145个变量的基线测量(临床,认知,环境,和14岁时的结构磁共振成像)被用作惩罚逻辑回归(具有不同程度的惩罚)的输入,以预测训练数据集(n=407)中的抑郁症发作。在独立的保留样本(三个独立的IMAGEN位点;n=137)中验证了对预测贡献最高的特征。
    在训练数据集中,用于预测抑郁症发作的受试者工作特征曲线下的面积介于0.70和0.72之间。抑郁症状的基线严重程度,女性性别,神经质,紧张的生活事件,和沟上回的表面积对预测模型和预测抑郁症的发作贡献最大,在独立验证样本中,受试者工作特征曲线下面积在0.68和0.72之间。
    这项研究表明,可以根据临床特征的组合多模式数据预测青少年的抑郁症发作,生活事件,人格特质,和大脑结构变量。
    Adolescent onset of depression is associated with long-lasting negative consequences. Identifying adolescents at risk for developing depression would enable the monitoring of risk factors and the development of early intervention strategies. Using machine learning to combine several risk factors from multiple modalities might allow prediction of depression onset at the individual level.
    A subsample of a multisite longitudinal study in adolescents, the IMAGEN study, was used to predict future (subthreshold) major depressive disorder onset in healthy adolescents. Based on 2-year and 5-year follow-up data, participants were grouped into the following: 1) those developing a diagnosis of major depressive disorder or subthreshold major depressive disorder and 2) healthy control subjects. Baseline measurements of 145 variables from different modalities (clinical, cognitive, environmental, and structural magnetic resonance imaging) at age 14 years were used as input to penalized logistic regression (with different levels of penalization) to predict depression onset in a training dataset (n = 407). The features contributing the highest to the prediction were validated in an independent hold-out sample (three independent IMAGEN sites; n = 137).
    The area under the receiver operating characteristic curve for predicting depression onset ranged between 0.70 and 0.72 in the training dataset. Baseline severity of depressive symptoms, female sex, neuroticism, stressful life events, and surface area of the supramarginal gyrus contributed most to the predictive model and predicted onset of depression, with an area under the receiver operating characteristic curve between 0.68 and 0.72 in the independent validation sample.
    This study showed that depression onset in adolescents can be predicted based on a combination multimodal data of clinical characteristics, life events, personality traits, and brain structure variables.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Analysis approaches for single compositional data are well established; however, effective analysis strategies for paired compositional data remain to be investigated. The current project was motivated by studies of age-related hearing loss (presbyacusis), where subjects are classified into four audiometric phenotypes that need to be ranked within these phenotypes based on their paired compositional data. We address this challenge by formulating this problem as a classification problem and integrating a penalized multinomial logistic regression model with compositional data analysis approaches. We utilize Elastic Net for a penalty function, while considering average, absolute difference, and perturbation operators for compositional data. We applied the proposed approach to the presbyacusis study of 532 subjects with probabilities that each ear of a subject belongs to each of four presbyacusis subtypes. We further investigated the ranking of presbyacusis subjects using the proposed approach based on previous literature. The data analysis results indicate that the proposed approach is effective for ranking subjects based on paired compositional data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Comparative Study
    这项研究评估了三种不同的机器学习(ML)技术在预测美国青少年滥用阿片类药物方面的预测性能。数据来自2015-2017年全国药物使用和健康调查(N=41,579名青少年,年龄12-17岁),并在2019年进行了分析。使用三种ML算法开发了预测模型,包括人工神经网络,分布式随机森林,和梯度增压机。将ML预测模型的性能与惩罚逻辑回归的性能进行了比较。使用接受者工作特征曲线下面积(AUROC)和精确召回曲线下面积(AUPRC)作为预测性能的度量。我们使用AUPRC作为预测性能的主要量度,因为它被认为比AUROC更有助于评估不平衡结果变量的二元分类器。美国青少年阿片类药物滥用的总体率为3.7%(n=1521)。四个模型的预测性能相似(AUROC值范围为0.809至0.815)。就AUPRC而言,分布式随机森林在预测中表现最好(0.172),其次是惩罚逻辑回归(0.162),梯度增压机(0.160),和人工神经网络(0.157)。研究结果表明,机器学习技术可以是一种有前途的技术,特别是在极少数情况下的结果预测中(即,当二元结果变量严重不平衡时),如青少年阿片类药物滥用。
    This study evaluated prediction performance of three different machine learning (ML) techniques in predicting opioid misuse among U.S. adolescents. Data were drawn from the 2015-2017 National Survey on Drug Use and Health (N = 41,579 adolescents, ages 12-17 years) and analyzed in 2019. Prediction models were developed using three ML algorithms, including artificial neural networks, distributed random forest, and gradient boosting machine. The performance of the ML prediction models was compared with performance of the penalized logistic regression. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used as metrics of prediction performance. We used the AUPRC as the primary measure of prediction performance given that it is considered more informative for assessing binary classifiers on imbalanced outcome variable than AUROC. The overall rate of opioid misuse among U.S. adolescents was 3.7% (n = 1521). Prediction performance was similar across the four models (AUROC values range from 0.809 to 0.815). In terms of the AUPRC, the distributed random forest showed the best performance in prediction (0.172) followed by penalized logistic regression (0.162), gradient boosting machine (0.160), and artificial neural networks (0.157). Findings suggest that machine learning techniques can be a promising technique especially in the prediction of outcomes with rare cases (i.e., when the binary outcome variable is heavily lopsided) such as adolescent opioid misuse.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Evaluation Study
    A high-dimensional quantitative structure-activity relationship (QSAR) classification model typically contains a large number of irrelevant and redundant descriptors. In this paper, a new design of descriptor selection for the QSAR classification model estimation method is proposed by adding a new weight inside L1-norm. The experimental results of classifying the anti-hepatitis C virus activity of thiourea derivatives demonstrate that the proposed descriptor selection method in the QSAR classification model performs effectively and competitively compared with other existing penalized methods in terms of classification performance on both the training and the testing datasets. Moreover, it is noteworthy that the results obtained in terms of stability test and applicability domain provide a robust QSAR classification model. It is evident from the results that the developed QSAR classification model could conceivably be employed for further high-dimensional QSAR classification studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Discovering important genes that account for the phenotype of interest has long been a challenge in genome-wide expression analysis. Analyses such as gene set enrichment analysis (GSEA) that incorporate pathway information have become widespread in hypothesis testing, but pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping pathways and the resulting lack of available software. The R package grpreg is widely used to fit group lasso and other group-penalized regression models; in this study, we develop an extension, grpregOverlap, to allow for overlapping group structure using a latent variable approach. We compare this approach to the ordinary lasso and to GSEA using both simulated and real data. We find that incorporation of prior pathway information can substantially improve the accuracy of gene expression classifiers, and we shed light on several ways in which hypothesis-testing approaches such as GSEA differ from regression approaches with respect to the analysis of pathway data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号