dimension reduction

降维
  • 文章类型: Journal Article
    Predictive patient stratification is greatly emerging, because it allows us to prospectively identify which patients will benefit from what interventions before their condition worsens. In the biomedical research, a number of stratification methods have been successfully applied and have assisted treatment process. Because of heterogeneity and complexity of medical data, it is very challenging to integrate them and make use of them in practical clinic. There are two major challenges of data integration. Firstly, since the biomedical data has a high number of dimensions, combining multiple data leads to the hard problem of vast dimensional space handling. The computation is enormously complex and time-consuming. Secondly, the disparity of different data types causes another critical problem in machine learning for biomedical data. It has a great need to develop an efficient machine learning framework to handle the challenges.
    In this paper, we propose a fast-multiple kernel learning framework, referred to as fMKL-DR, that optimise equations to calculate matrix chain multiplication and reduce dimensions in data space. We applied our framework to two case studies, Alzheimer\'s disease (AD) patient stratification and cancer patient stratification. We performed several comparative evaluations on various biomedical datasets.
    In the case study of AD patients, we enhanced significantly the multiple-ROIs approach based on MRI image data. The method could successfully classify not only AD patients and non-AD patients but also different phases of AD patients with AUC close to 1. In the case study of cancer patients, the framework was applied to six types of cancers, i.e., glioblastoma multiforme cancer, ovarian cancer, lung cancer, breast cancer, kidney cancer, and liver cancer. We efficiently integrated gene expression, miRNA expression, and DNA methylation. The results showed that the classification model basing on integrated datasets was much more accurate than classification model basing on the single data type.
    The results demonstrated that the fMKL-DR remarkably improves computational cost and accuracy for both AD patient and cancer patient stratification. We optimised the data integration, dimension reduction, and kernel fusion. Our framework has great potential for mining large-scale cohort data and aiding personalised prevention.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    BACKGROUND: In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other.
    RESULTS: We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project.
    CONCLUSIONS: The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于回顾性数据研究协变量之间的关系是二次分析的主要目的,越来越感兴趣的领域。当多个协变量可用时,我们检查二次分析问题,而只指定了回归均值模型。尽管回归均值函数完全参数化建模,数据的病例控制性质需要特殊处理,半参数有效估计会产生具有多变量协变量的各种非参数估计问题。我们设计了一种降维方法,该方法适合原始问题设置中指定的主要和次要模型,并使用重新加权来调整数据的病例控制性质,即使来源人群的发病率未知。所得到的估计器既是局部有效的,又对回归误差分布的错误指定具有鲁棒性,可以是异方差的,也可以是非高斯的。我们展示了我们的方法相对于几种现有方法的优势,在分析和数值上。
    Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    BACKGROUND: Speech disorders such as dysphonia and dysarthria represent an early and common manifestation of Parkinson\'s disease. Class prediction is an essential task in automatic speech treatment, particularly in the Parkinson\'s disease case. Many classification experiments have been performed which focus on the automatic detection of Parkinson\'s disease patients from healthy speakers but results are still not optimistic. A major problem in accomplishing this task is high dimensionality of speech data.
    OBJECTIVE: In this work, the potential of Principal Component Analysis (PCA) based modeling in dimensionality reduction is taken into consideration as the data smoothening tool with multiclass target expression data.
    METHODS: On the basis of suggested PCA-based modeling, the power of class prediction using logistic regression (LR) and C5.0 in numeric data is investigated in publicly available Parkinson\'s disease dataset Silverman voice treatment (LSVT) to develop an advanced classification model.
    RESULTS: The main advantage of our model is the effective reduction of the number of factors from p= 309 to k= 32 for LSVT Voice Rehabilitation dataset, with a fine classification accuracy of 100% and 99.92% for PCA-LR and PCA-C5.0 respectively. In addition, using only 9 dysphonia features, classification accuracy was (99.20%) and (99.11%) for PCA-LR, and PCA-C5.0 respectively.
    CONCLUSIONS: Our combined dimension reduction and data smoothening approaches have significant potential to minimize the number of features and increase the classification accuracy and then automatically classify subjects into Parkinson\'s disease patients or healthy speakers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号