Catboost

CatBoost
  • 文章类型: Journal Article
    鉴于蛋白质组学的最新技术进步,现在有可能对大量患者的血浆蛋白质组进行定量,以筛选生物标志物并指导抑郁症的早期诊断和治疗.在这里,我们使用CatBoost机器学习在英国生物库数据集中建模和发现抑郁症的生物标志物(抑郁症n=4,479,健康对照n=19,821)。CatBoost被用于模型构建,使用Shapley加法解释(SHAP)来解释所得模型。模型性能通过5倍交叉验证得到证实,并根据受试者工作特征(AUC)曲线下面积评价其诊断效能。根据CatBoost模型在6个数据集中输出的前20个重要特征,共筛选了45个抑郁症相关蛋白。在抑郁症的九种诊断模型中,添加蛋白质组数据后,传统风险因素模型的性能得到了改善,最佳模型在测试集中的平均AUC为0.764。对45种筛选蛋白质的KEGG途径分析表明,涉及的最重要的途径是细胞因子-细胞因子受体相互作用。使用数据驱动的机器学习方法和大规模数据集探索抑郁症的诊断生物标志物是可行的,尽管结果需要验证。
    Given recent technological advances in proteomics, it is now possible to quantify plasma proteomes in large cohorts of patients to screen for biomarkers and to guide the early diagnosis and treatment of depression. Here we used CatBoost machine learning to model and discover biomarkers of depression in UK Biobank data sets (depression n = 4,479, healthy control n = 19,821). CatBoost was employed for model construction, with Shapley Additive Explanations (SHAP) being utilized to interpret the resulting model. Model performance was corroborated through 5-fold cross-validation, and its diagnostic efficacy was evaluated based on the area under the receiver operating characteristic (AUC) curve. A total of 45 depression-related proteins were screened based on the top 20 important features output by the CatBoost model in six data sets. Of the nine diagnostic models for depression, the performance of the traditional risk factor model was improved after the addition of proteomic data, with the best model having an average AUC of 0.764 in the test sets. KEGG pathway analysis of 45 screened proteins showed that the most significant pathway involved was the cytokine-cytokine receptor interaction. It is feasible to explore diagnostic biomarkers of depression using data-driven machine learning methods and large-scale data sets, although the results require validation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    总的来说,设计安全合理的巷道支护方案是确保矿山开采安全和效率的关键前提。然而,传统的矿巷经验支持系统在评估支护方法的合理性方面面临挑战,这可能会损害巷道的安全性和可靠性。为了解决这个问题,将安全系数纳入巷道支护研究,建立了符合安全系数的巷道支护安全评价方法。根据中国中部特定铁矿巷道的数据,采用CRITIC方法对样本数据进行预处理。更进一步,利用贝叶斯算法优化CatBoost模型的超参数,然后提出了基于BO-CatBoost模型的预测模型,用于评估平原喷射混凝土支护的巷道安全系数。此外,性能指标,例如均方根误差(RMSE),平均绝对误差(MAE),相关系数(R2),方差占(VAF),和a-20指数,确定检查每个提出的模型的预测性能。与其他型号相比,BO-CatBoost模型证明了RMSE和MAE最低的安全系数的最优预测输出项,最大的R2和VAF,和适当的a-20指标值为0.5688、0.4074、0.9553、95.25%、和0.9167在测试集中,分别。因此,BO-CatBoost模型被证明是最合适的机器学习方法,可以更准确地预测安全系数,这将为优化巷道支护设计和巷道安全评价提供一种新的方法。
    In general, the design of a safe and rational laneway support scheme signifies a crucial prerequisite for ensuring the security and efficiency of mining exploitation in mines. Nevertheless, the conventional empirical support system for mining laneways faces challenges in assessing the rationality of support methods, which can compromise the safety and reliability of the laneways. To address this issue, the safety factor was incorporated into research on laneway support, and a safety evaluation method for laneway support in line with the safety factor was established. In light of the data from a specific iron mine laneway in central China, the CRITIC method was employed to preprocess the sample data. Going one step further, a Bayesian algorithm was utilized to optimize the hyperparameters of the CatBoost model, followed by proposing a prediction model based on the BO-CatBoost model for evaluating laneway safety factors of plain shotcrete support. Furthermore, the performance indexes, such as the root mean square error (RMSE), the mean absolute error (MAE), the correlation coefficient (R2), the variance accounts for (VAF), and the a-20 index, were determined to examine the predictive performance of each proposed model. In contrast to the other models, the BO-CatBoost model demonstrated the optimal predictive output item for safety factors with the lowest RMSE and MAE, the largest R2 and VAF, and an appropriate a-20 index value of 0.5688, 0.4074, 0.9553, 95.25%, and 0.9167 in the test set, respectively. Therefore, the BO-CatBoost model was proven to be the most appropriate machine learning method that can more accurately predict the safety factor, which will provide a novel approach for optimizing laneway support design and laneway safety evaluation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    蛋白质-DNA复合物相互作用在基因表达等生物活性中起着至关重要的作用,修改,复制和转录。了解蛋白质-DNA结合界面热点的生理意义,以及计算生物学的发展,取决于这些区域的精确识别。在本文中,提出了一种称为EC-PDH的热点预测方法。首先,我们提取了这些热点的特征\'固体溶剂可及表面积(ASA)和二级结构,然后是意思,方差,通过经验模态分解算法(EMD)提取这些传统特征的前三个固有模态分量(IMFs)的能量和自相关函数值作为新特征。总共获得218个维度特征。对于特征选择,我们使用最大相关最小冗余序列正向选择方法(mRMR-SFS)来获得最佳的11维特征子集。为了解决数据不平衡的问题,我们使用SMOTE-Tomek算法来平衡正负样本,最后使用cat梯度增强(CatBoost)构建蛋白质-DNA结合界面的热点预测模型.我们的方法在测试集上表现良好,AUC,MCC和F1得分值分别为0.847、0.543和0.772。经过比较评估,EC-PDH在识别热点方面优于现有的最先进的方法。
    Protein-DNA complex interactivity plays a crucial role in biological activities such as gene expression, modification, replication and transcription. Understanding the physiological significance of protein-DNA binding interfacial hot spots, as well as the development of computational biology, depends on the precise identification of these regions. In this paper, a hot spot prediction method called EC-PDH is proposed. First, we extracted features of these hot spots\' solid solvent-accessible surface area (ASA) and secondary structure, and then the mean, variance, energy and autocorrelation function values of the first three intrinsic modal components (IMFs) of these conventional features were extracted as new features via the empirical modal decomposition algorithm (EMD). A total of 218 dimensional features were obtained. For feature selection, we used the maximum correlation minimum redundancy sequence forward selection method (mRMR-SFS) to obtain an optimal 11-dimensional-feature subset. To address the issue of data imbalance, we used the SMOTE-Tomek algorithm to balance positive and negative samples and finally used cat gradient boosting (CatBoost) to construct our hot spot prediction model for protein-DNA binding interfaces. Our method performs well on the test set, with AUC, MCC and F1 score values of 0.847, 0.543 and 0.772, respectively. After a comparative evaluation, EC-PDH outperforms the existing state-of-the-art methods in identifying hot spots.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    监管机构在审查过程中会产生大量的文本数据。例如,药品标签是监管机构的宝贵资源,如美国食品和药物管理局(FDA)和欧洲医学署(EMA),向医疗保健专业人员和患者传达药物安全性和有效性信息。药物标签也是药物警戒和药物安全性研究的资源。自动文本分类将大大改善药品标签文档的分析并节省审阅者资源。
    我们在这项研究中利用人工智能对基于FDA的DILIrank数据集的药物标签文件中的药物诱导肝损伤(DILI)相关内容进行分类。我们采用了文本挖掘和XGBoost模型,并利用不良事件标准的首选医学查询术语来简化常见单词和短语的消除,同时保留FDA和EMA药物标签数据集的医学标准术语。然后,我们使用通过术语频率-逆文档频率(TF-IDF)为每个包含的单词/术语/标记计算的权重来构建文档术语矩阵。
    自动文本分类模型在预测DILI方面表现出强大的性能,FDA和EMA的药物标签以及海量数据分析关键评估(CAMDA)的文献摘要的交叉验证AUC得分均超过0.90。
    此外,本研究中演示的文本挖掘和XGBoost函数可以应用于其他文本处理和分类任务。
    UNASSIGNED: Regulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.
    UNASSIGNED: We utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA\'s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.
    UNASSIGNED: The automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).
    UNASSIGNED: Moreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    碳价格是碳交易领域的关键要素。碳价格的准确估算可以为碳市场参与者提供准确的指导。本研究引入了一种新颖的预测模型,该模型包含碳价格的点和区间预测。首先,为了提炼出碳价固有的波动性特征,利用连续变分模态分解将碳价自适应分解为规则序列。其次,为了获得最佳输入变量,利用偏自相关函数和随机森林对影响因素和历史碳价格进行筛选。然后,为了避免单一模型约束,采用麻雀搜索算法优化的分类提升和核极限学习机的组合模型进行点预测,并采用shapley加性解释来阐明模型预测过程。最后,为了提供更有效的信息,将自适应带宽核密度估计应用于区间预测。以湖北碳市场数据为例,结果表明,平均绝对误差,平均绝对百分比误差,模型的均方根误差和R2分别为0.1022、0.0022、0.1262和0.9921。历史碳价格,布伦特原油期货结算价和欧盟配额期货碳价格对碳价格有正向影响,和沪深300对碳价有负面影响。与常数核密度估计相比,该模型实现了更高的区间覆盖概率和更低的区间宽度。因此,混合模式的应用可以促进碳市场的运行效率,促进碳减排政策的实施。
    Carbon price is a pivotal element in the carbon trading sector. Accurate estimation of carbon price can offer precise guidance for the carbon market participants. This study introduces a novel prediction model encompassing both point and interval prediction for the carbon price. Firstly, to distill the volatility traits inherent in carbon price, the successive variational mode decomposition is utilized to adaptively decompose the carbon price into regular sequences. Secondly, to obtain the optimal input variables, the partial autocorrelation function and random forest are employed to filter the influencing factors and historical carbon price. Then, to avoid single model constraint, a combination model of categorical boosting and kernel extreme learning machine optimized by the sparrow search algorithm is employed for the point prediction, and the shapley additive explanation is employed to elucidate the model prediction process. Finally, to provide more efficient information, the adaptive bandwidth kernel density estimation is applied to the interval prediction. The data from Hubei carbon market is adopted as a case study, and the results indicate that the mean absolute error, mean absolute percentage error, root mean square error and R2 of the proposed model are 0.1022, 0.0022, 0.1262 and 0.9921, respectively. The historical carbon price, Brent crude oil futures settlement price and European Union allowance futures carbon price have a positive impact on carbon price, and Hushen 300 has a negative impact on carbon price. Compared with the constant kernel density estimation, the proposed model achieves higher interval coverage probability and lower interval width. Thus, the application of the hybrid model can promote the operational efficiency of the carbon market and facilitate the implementation of carbon emission reduction policies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:本研究旨在对老年髋部骨折患者的术后肺炎(POP)实施有效的预测模型和应用介质,以促进临床医生的个性化干预。
    方法:利用老年髋部骨折患者的临床资料,我们推导并外部验证了用于预测POP的机器学习模型。模型推导利用南京市第一医院的注册表,使用南京医科大学第四附属医院患者的数据进行外部验证.推导队列分为训练集和测试集。使用最小绝对收缩和选择算子(LASSO)和多变量逻辑回归进行特征筛选。我们比较了模型的性能以选择优化的模型,并引入了SHapley加法扩张(SHAP)来解释模型。
    结果:推导和验证队列包括498名和124名患者,有14.3%和10.5%的流行率,分别。在这些模型中,分类提升(Catboost)表现出优越的辨别能力。训练集和测试集的AUROC分别为0.895(95CI:0.841-0.949)和0.835(95CI:0.740-0.930),分别。在外部验证时,AUROC为0.894(95%CI:0.821-0.966)。SHAP方法显示CRP,修改后的五项脆弱指数(mFI-5),ASA的身体状态是POP的三大重要预测因素。
    结论:我们的模型具有良好的早期预测能力,结合基于Catboost模型的网络风险计算器的实现,预计将有效区分高危人群,促进及时干预。
    BACKGROUND: This study aims to implement a validated prediction model and application medium for postoperative pneumonia (POP) in elderly patients with hip fractures in order to facilitate individualized intervention by clinicians.
    METHODS: Employing clinical data from elderly patients with hip fractures, we derived and externally validated machine learning models for predicting POP. Model derivation utilized a registry from Nanjing First Hospital, and external validation was performed using data from patients at the Fourth Affiliated Hospital of Nanjing Medical University. The derivation cohort was divided into the training set and the testing set. The least absolute shrinkage and selection operator (LASSO) and multivariable logistic regression were used for feature screening. We compared the performance of models to select the optimized model and introduced SHapley Additive exPlanations (SHAP) to interpret the model.
    RESULTS: The derivation and validation cohorts comprised 498 and 124 patients, with 14.3% and 10.5% POP rates, respectively. Among these models, Categorical boosting (Catboost) demonstrated superior discrimination ability. AUROC was 0.895 (95%CI: 0.841-0.949) and 0.835 (95%CI: 0.740-0.930) on the training and testing sets, respectively. At external validation, the AUROC amounted to 0.894 (95% CI: 0.821-0.966). The SHAP method showed that CRP, the modified five-item frailty index (mFI-5), and ASA body status were among the top three important predicators of POP.
    CONCLUSIONS: Our model\'s good early prediction ability, combined with the implementation of a network risk calculator based on the Catboost model, was anticipated to effectively distinguish high-risk POP groups, facilitating timely intervention.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了通过定量构效关系预测咔唑衍生化合物的抗锥虫作用,通过线性方法建立了五个模型,随机森林,径向基核函数支持向量机,线性组合混合核函数支持向量机,和非线性组合混合核函数支持向量机(NLMIX-SVM)。启发式方法和优化的CatBoost被用来选择两个不同的关键描述符集,用于建立线性和非线性模型,分别。采用综合学习粒子群算法对所有非线性模型中的超参数进行优化,算法复杂度低,收敛速度快。此外,模型的健壮性和可靠性经过严格的评估,使用五倍和留一法交叉验证,y-随机化,和统计数据,包括一致性相关系数(CCC),[公式:见正文],[公式:见正文],和[公式:见正文]。在所有的模型中,NLMIX-SVM模型,这是通过支持向量回归使用径向基核函数的非线性组合来建立的,sigmoid核函数,和线性核函数作为一个新的核函数,展示了出色的学习和泛化能力以及鲁棒性:[公式:请参见文本]=0.9581,均方误差(MSE)=0.0199的训练集和[公式:请参见文本]=0.9528,MSE=0.0174的测试集。[公式:见正文],[公式:见正文],CCC,[公式:见正文],[公式:见正文],和[公式:见正文]分别为0.9539、0.8908、0.9752、0.9529、0.9528和0.9633。NLMIX-SVM方法被证明是定量结构-活性关系研究中的一种有前途的方法。此外,分子对接实验分析了新衍生物的性质,并最终发现了一种新的潜在候选药物分子。总之,本研究将为新型抗锥虫药物的设计和筛选提供帮助。
    In order to predict the anti-trypanosome effect of carbazole-derived compounds by quantitative structure-activity relationship, five models were established by the linear method, random forest, radial basis kernel function support vector machine, linear combination mix-kernel function support vector machine, and nonlinear combination mix-kernel function support vector machine (NLMIX-SVM). The heuristic method and optimized CatBoost were used to select two different key descriptor sets for building linear and nonlinear models, respectively. Hyperparameters in all nonlinear models were optimized by comprehensive learning particle swarm optimization with low complexity and fast convergence. Furthermore, the models\' robustness and reliability underwent rigorous assessment using fivefold and leave-one-out cross-validation, y-randomization, and statistics including concordance correlation coefficient (CCC), [Formula: see text] , [Formula: see text] , and [Formula: see text] . Among all the models, the NLMIX-SVM model, which was established by support vector regression using a nonlinear combination of radial basis kernel function, sigmoid kernel function, and linear kernel function as a new kernel function, demonstrated excellent learning and generalization abilities as well as robustness: [Formula: see text] = 0.9581, mean square error (MSE) = 0.0199 for the training set and [Formula: see text] = 0.9528, MSE = 0.0174 for the test set. [Formula: see text] , [Formula: see text] , CCC, [Formula: see text] , [Formula: see text], and [Formula: see text] are 0.9539, 0.8908, 0.9752, 0.9529, 0.9528, and 0.9633, respectively. The NLMIX-SVM method proved to be a promising way in quantitative structure-activity relationship research. In addition, molecular docking experiments were conducted to analyze the properties of new derivatives, and a new potential candidate drug molecule was ultimately found. In summary, this study will provide help for the design and screening of novel anti-trypanosome drugs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:这项研究的目的是建立一个有效的机器学习模型,以帮助预测患有烟雾病(MMD)的成年卒中患者的卒中复发。同时分析中风复发的因素。
    方法:本回顾性研究数据来源于江西省医疗大数据工程技术研究中心数据库。此外,南昌大学第二附属医院1月1日起收治的MMD患者信息,2007年12月31日,2019年被收购。1月1日共有661名患者,2007年2月28日,2017年被涵盖在培训集中,而外部验证集由284名患者组成,这些患者从3月1日起进入范围,2017年12月31日,2019.首先,在训练集和外部验证集之间比较了所有受试者的信息.使用Lasso回归算法筛选出关键影响变量。此外,基于五种不同的机器学习算法,建立了预测卒中后1年、2年和3年卒中复发的模型,所有模型都经过外部验证,然后进行比较。最后,使用Shapley加法扩张(SHAP)解释模型解释了具有最佳性能的CatBoost模型。
    结果:一般来说,招募了945名患有MMD的患者,首次卒中后1年、2年和3年的急性卒中复发率达到11.43%(108/945),18.94%(179/945),和23.17%(219/945),分别。CatBoost模型在所有模型中表现出最佳的预测性能;这些模型预测1年、2年和3年中风复发的曲线下面积(AUC)被确定为0.794(0.787,0.801),0.813(0.807,0.818),和0.789(0.783,0.795),分别。如SHAP解释模型的结果表明,铃木的舞台,年轻人(18-44岁),没有手术治疗,在接受MMD治疗的成年卒中患者中,动脉瘤的存在可能与卒中复发显著相关.
    结论:在患有MMD的成年中风患者中,CatBoost模型被证实在中风复发预测中有效,产生准确可靠的预测结果。高铃木舞台,年轻人(18-44岁),没有手术治疗,在接受MMD治疗的成年卒中患者中,动脉瘤的存在可能与卒中复发显著相关.
    The aim of this study was at building an effective machine learning model to contribute to the prediction of stroke recurrence in adult stroke patients subjected to moyamoya disease (MMD), while at analyzing the factors for stroke recurrence.
    The data of this retrospective study originated from the database of JiangXi Province Medical Big Data Engineering & Technology Research Center. Moreover, the information of MMD patients admitted to the second affiliated hospital of Nanchang university from January 1st, 2007 to December 31st, 2019 was acquired. A total of 661 patients from January 1st, 2007 to February 28th, 2017 were covered in the training set, while the external validation set comprised 284 patients that fell into a scope from March 1st, 2017 to December 31st, 2019. First, the information regarding all the subjects was compared between the training set and the external validation set. The key influencing variables were screened out using the Lasso Regression Algorithm. Furthermore, the models for predicting stroke recurrence in 1, 2, and 3 years after the initial stroke were built based on five different machine learning algorithms, and all models were externally validated and then compared. Lastly, the CatBoost model with the optimal performance was explained using the SHapley Additive exPlanations (SHAP) interpretation model.
    In general, 945 patients suffering from MMD were recruited, and the recurrence rate of acute stroke in 1, 2, and 3 years after the initial stroke reached 11.43%(108/945), 18.94%(179/945), and 23.17%(219/945), respectively. The CatBoost models exhibited the optimal prediction performance among all models; the area under the curve (AUC) of these models for predicting stroke recurrence in 1, 2, and 3 years was determined as 0.794 (0.787, 0.801), 0.813 (0.807, 0.818), and 0.789 (0.783, 0.795), respectively. As indicated by the results of the SHAP interpretation model, the high Suzuki stage, young adults (aged 18-44), no surgical treatment, and the presence of an aneurysm were likely to show significant correlations with the recurrence of stroke in adult stroke patients subjected to MMD.
    In adult stroke patients suffering from MMD, the CatBoost model was confirmed to be effective in stroke recurrence prediction, yielding accurate and reliable prediction outcomes. High Suzuki stage, young adults (aged 18-44 years), no surgical treatment, and the presence of an aneurysm are likely to be significantly correlated with the recurrence of stroke in adult stroke patients subjected to MMD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    糖尿病视网膜病变是糖尿病的主要并发症之一。在这项研究中,为了提高糖尿病视网膜病变风险预测的准确性,建立了融合机器学习模型和SHAP的糖尿病视网膜病变风险预测模型,解释模型预测结果的合理性,提高预测结果的可靠性。
    对缺失值和异常值的数据进行了预处理,通过信息增益选择的特征,使用CatBoost建立的糖尿病视网膜病变风险预测模型和使用SHAP模型解释的模式的输出。
    本研究使用了来自国家临床医学科学数据中心的糖尿病并发症预警数据集的一千个糖尿病并发症预警数据。基于CatBoost的糖尿病视网膜病变预测模型在对比模型试验中表现最好。ALB_CR,HbA1c,UPR_24、肾病和SCR与糖尿病视网膜病变呈正相关,而CP,HB,ALB,DBILI和CRP与糖尿病视网膜病变呈负相关。HEIGHT之间的关系,WIGHT和ESR特点与糖尿病视网膜病变无显著关系。
    糖尿病视网膜病变的危险因素包括肾功能差,血糖水平升高,肝病,血液病和动脉收缩异常,在其他人中。通过监测和有效控制相关指标可预防糖尿病视网膜病变。在这项研究中,分析各特征间的影响关系,进一步探讨糖尿病视网膜病变的潜在因素,可为后续糖尿病视网膜病变的早期预防和临床诊断提供新方法和新思路。
    UNASSIGNED: Diabetic retinopathy is one of the major complications of diabetes. In this study, a diabetic retinopathy risk prediction model integrating machine learning models and SHAP was established to increase the accuracy of risk prediction for diabetic retinopathy, explain the rationality of the findings from model prediction and improve the reliability of prediction results.
    UNASSIGNED: Data were preprocessed for missing values and outliers, features selected through information gain, a diabetic retinopathy risk prediction model established using the CatBoost and the outputs of the mode interpreted using the SHAP model.
    UNASSIGNED: One thousand early warning data of diabetes complications derived from diabetes complication early warning dataset from the National Clinical Medical Sciences Data Center were used in this study. The CatBoost-based model for diabetic retinopathy prediction performed the best in the comparative model test. ALB_CR, HbA1c, UPR_24, NEPHROPATHY and SCR were positively correlated with diabetic retinopathy, while CP, HB, ALB, DBILI and CRP were negatively correlated with diabetic retinopathy. The relationships between HEIGHT, WEIGHT and ESR characteristics and diabetic retinopathy were not significant.
    UNASSIGNED: The risk factors for diabetic retinopathy include poor renal function, elevated blood glucose level, liver disease, hematonosis and dysarteriotony, among others. Diabetic retinopathy can be prevented by monitoring and effectively controlling relevant indices. In this study, the influence relationships between the features were also analyzed to further explore the potential factors of diabetic retinopathy, which can provide new methods and new ideas for the early prevention and clinical diagnosis of subsequent diabetic retinopathy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究旨在探讨不同的非滑坡采样策略对滑坡敏感性制图中机器学习模型的影响。非滑坡样本本质上是不确定的,并且非滑坡样本的选择可能会遇到诸如嘈杂或区域代表性不足等问题,这可能会影响结果的准确性。在这项研究中,针对非滑坡样本选择,引入了一种积极的无标记(PU)套袋半监督学习方法。此外,采用缓冲液对照抽样(BCS)和K-均值(KM)聚类进行比较分析。根据巧家县的滑坡资料,云南省,中国,2014年收集的三种机器学习模型,即,随机森林,支持向量机,和CatBoost,用于滑坡敏感性制图。结果表明,采用不同的非滑坡抽样策略选取的样本质量差异显著。总的来说,使用PU套袋方法选择的非滑坡样品质量较好,该方法与CatBoost结合用于预测(AUC=0.897)在极高和高敏感性区域(82.14%)的滑坡时表现最佳。此外,KM结果表明过拟合,显示验证的准确性高,但分区的统计结果较差。BCS结果最差。
    This study aims to explore the effects of different non-landslide sampling strategies on machine learning models in landslide susceptibility mapping. Non-landslide samples are inherently uncertain, and the selection of non-landslide samples may suffer from issues such as noisy or insufficient regional representations, which can affect the accuracy of the results. In this study, a positive-unlabeled (PU) bagging semi-supervised learning method was introduced for non-landslide sample selection. In addition, buffer control sampling (BCS) and K-means (KM) clustering were applied for comparative analysis. Based on landslide data from Qiaojia County, Yunnan Province, China, collected in 2014, three machine learning models, namely, random forest, support vector machine, and CatBoost, were used for landslide susceptibility mapping. The results show that the quality of samples selected using different non-landslide sampling strategies varies significantly. Overall, the quality of non-landslide samples selected using the PU bagging method is superior, and this method performs best when combined with CatBoost for predicting (AUC = 0.897) landslides in very high and high susceptibility zones (82.14%). Additionally, the KM results indicated overfitting, displaying high accuracy for validation but poor statistical outcomes for zoning. The BCS results were the worst.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号