Disease risk prediction model

  • 文章类型: Journal Article
    背景:全基因组关联研究已成功鉴定出与人类疾病相关的遗传变异。最近已经提出了基于惩罚和机器学习方法的各种统计方法用于疾病预测。在这项研究中,我们使用韩国基因组和流行病学研究(KoGES)的韩国芯片(KORV1.1)评估了几种此类方法预测哮喘的性能.
    结果:首先,通过单变异检测,采用logistic回归分析并调整了几个流行病学因素,筛选出单核苷酸多态性.接下来,我们评估了以下疾病预测方法:里奇,最小绝对收缩和选择运算符,弹性网,平滑地削减绝对偏差,支持向量机,随机森林,升压,装袋,天真贝叶斯,和k最近的邻居。最后,我们根据接收器工作特性曲线的曲线下面积比较了它们的预测性能,精度,召回,F1分数,Cohen\'sKappa,平衡精度,错误率,马修斯相关系数,和精确召回率曲线下的面积。此外,三种过采样算法用于处理不平衡问题。
    结论:我们的结果表明,与通过机器学习方法相比,惩罚方法对哮喘表现出更好的预测性能。另一方面,在过抽样研究中,随机森林和增强方法总体上显示出比惩罚方法更好的预测性能。
    BACKGROUND: Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES).
    RESULTS: First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen\'s Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems.
    CONCLUSIONS: Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    系统性红斑狼疮(SLE)是一种潜伏的,阴险的自身免疫性疾病,随着近年来基因测序技术的发展,我们的研究旨在建立一个基于基因的预测模型,以探索在遗传水平上对SLE的鉴定.首先,从基因表达综合(GEO)数据库收集SLE全血样品的基因表达数据集。合并数据集后,按7:3的比例分为训练数据集和验证数据集,其中训练数据集的SLE样本和健康样本分别为334和71,验证数据集的SLE样本和健康样本分别为143和30。利用训练数据集构建疾病风险预测模型,并利用验证数据集对模型辨识能力进行验证。我们首先分析了差异表达基因(DEGs),然后使用Lasso和随机森林(RF)筛选出六个关键基因(OAS3,USP18,RTP4,SPATS2L,IFI27和OAS1),这对于区分SLE和健康样本至关重要。整合了6个关键基因,并在RF模型中进行了5次10倍交叉验证,我们最终确定了具有最优mtry的射频模型。模型的曲线下面积(AUC)和准确度的平均值超过0.95。然后使用验证数据集来评估AUC性能,并且我们的模型具有0.948的AUC。使用AUC为0.810、准确度为0.836和灵敏度为0.921的外部验证数据集(GSE99967)评估模型性能。所有SLE患者的外部验证数据集(GSE185047)得出的SLE敏感性高达0.954。最终的高通量RF模型的AUC平均值超过0.9,再次显示出良好的结果。总之,我们确定了关键的遗传生物标志物,并成功开发了一种新的SLE疾病风险预测模型,该模型可用作新的SLE疾病风险预测辅助手段,有助于SLE的鉴定.
    Systemic lupus erythematosus (SLE) is a latent, insidious autoimmune disease, and with the development of gene sequencing in recent years, our study aims to develop a gene-based predictive model to explore the identification of SLE at the genetic level. First, gene expression datasets of SLE whole blood samples were collected from the Gene Expression Omnibus (GEO) database. After the datasets were merged, they were divided into training and validation datasets in the ratio of 7:3, where the SLE samples and healthy samples of the training dataset were 334 and 71, respectively, and the SLE samples and healthy samples of the validation dataset were 143 and 30, respectively. The training dataset was used to build the disease risk prediction model, and the validation dataset was used to verify the model identification ability. We first analyzed differentially expressed genes (DEGs) and then used Lasso and random forest (RF) to screen out six key genes (OAS3, USP18, RTP4, SPATS2L, IFI27 and OAS1), which are essential to distinguish SLE from healthy samples. With six key genes incorporated and five iterations of 10-fold cross-validation performed into the RF model, we finally determined the RF model with optimal mtry. The mean values of area under the curve (AUC) and accuracy of the models were over 0.95. The validation dataset was then used to evaluate the AUC performance and our model had an AUC of 0.948. An external validation dataset (GSE99967) with an AUC of 0.810, an accuracy of 0.836, and a sensitivity of 0.921 was used to assess the model\'s performance. The external validation dataset (GSE185047) of all SLE patients yielded an SLE sensitivity of up to 0.954. The final high-throughput RF model had a mean value of AUC over 0.9, again showing good results. In conclusion, we identified key genetic biomarkers and successfully developed a novel disease risk prediction model for SLE that can be used as a new SLE disease risk prediction aid and contribute to the identification of SLE.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    In this paper, we propose feature extraction method for prediction model for at the early stage of diabetic kidney disease (DKD) progression. DKD needs continuous treatment; however, a hospital visit interval of a patient at the early stage of DKD is normally from one month to three months, and this is not a short time period. Therefore it makes difficult to apply sophisticated approaches such as using convolutional neural networks because of the data limitation. The propose method uses with hierarchical clustering that can estimate a suitable interval for grouping inputted sequences. We evaluate the proposed method with a real-EMR dataset that consists of 30,810 patient records and conclude that the proposed method outperforms the baseline methods derived from related work.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号