Feature reduction

特征约简
  • 文章类型: Journal Article
    脑机接口(BCI)是获取大脑电活动并提供外部设备控制的系统。由于脑电图(EEG)是捕获大脑电活动的最简单的非侵入性方法,基于EEG的BCI是非常流行的设计。除了对四肢运动进行分类之外,最近的BCI研究集中在通过采用机器学习技术对同一只手上的手指运动进行分类的准确编码。最先进的研究有兴趣通过忽略大脑的空闲情况来编码五个手指运动(即,大脑不执行任何心理任务的状态)。这可能容易导致更多的误报,并大大降低分类性能,因此,BCI的表现。这项研究旨在提出一种更现实的系统,以从EEG信号中解码五个手指的运动和无心理任务(NoMT)情况。
    在这项研究中,利用了一种新颖的特征提取方法。使用通过固有时间尺度分解(ITD)计算的正确旋转分量(PRCs),最近已成功应用于不同的生物医学信号,提取用于分类的特征。随后,这些特征被应用于众所周知的分类器的输入及其不同的实现,以区分这六个类别。报告了在独立于受试者和依赖受试者的情况下获得的最高分类器性能。此外,检查了基于ANOVA的特征选择,以确定统计上显著的特征是否对分类器性能有影响.
    因此,集成学习分类器在测试分类器中达到了55.0%的最高准确率,和基于ANOVA的特征选择提高了分类器在基于EEG的BCI系统中对五指运动确定的性能。
    与类似研究相比,提出的实践在分类性能上实现了适度但显著的改进,尽管类的数量增加了一个(即,NoMT)。
    UNASSIGNED: Brain-computer interfaces (BCIs) are systems that acquire the brain\'s electrical activity and provide control of external devices. Since electroencephalography (EEG) is the simplest non-invasive method to capture the brain\'s electrical activity, EEG-based BCIs are very popular designs. Aside from classifying the extremity movements, recent BCI studies have focused on the accurate coding of the finger movements on the same hand through their classification by employing machine learning techniques. State-of-the-art studies were interested in coding five finger movements by neglecting the brain\'s idle case (i.e., the state that brain is not performing any mental tasks). This may easily cause more false positives and degrade the classification performances dramatically, thus, the performance of BCIs. This study aims to propose a more realistic system to decode the movements of five fingers and the no mental task (NoMT) case from EEG signals.
    UNASSIGNED: In this study, a novel praxis for feature extraction is utilized. Using Proper Rotational Components (PRCs) computed through Intrinsic Time Scale Decomposition (ITD), which has been successfully applied in different biomedical signals recently, features for classification are extracted. Subsequently, these features were applied to the inputs of well-known classifiers and their different implementations to discriminate between these six classes. The highest classifier performances obtained in both subject-independent and subject-dependent cases were reported. In addition, the ANOVA-based feature selection was examined to determine whether statistically significant features have an impact on the classifier performances or not.
    UNASSIGNED: As a result, the Ensemble Learning classifier achieved the highest accuracy of 55.0% among the tested classifiers, and ANOVA-based feature selection increases the performance of classifiers on five-finger movement determination in EEG-based BCI systems.
    UNASSIGNED: When compared with similar studies, proposed praxis achieved a modest yet significant improvement in classification performance although the number of classes was incremented by one (i.e., NoMT).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    激酶融合基因是人类癌症融合基因中最活跃的融合基因群。帮助选择具有临床意义的激酶,以便具有融合基因的癌症患者可以更好地诊断,我们需要一个度量来推断泛癌融合基因中激酶的评估,而不是依赖于样本频率表达的融合基因。最重要的是,多项研究使用多种类型的基因组和临床信息评估人类激酶作为药物靶标,但是在他们的研究中没有人使用激酶融合基因。对无激酶融合基因事件的激酶的评估研究可能错过了增强激酶在癌症中的功能的机制之一的作用。为了填补这个空白,在这项研究中,我们提出了一种使用网络传播方法评估基因的新方法,以推断单个激酶影响由〜5K激酶融合基因对组成的激酶融合基因网络的可能性。为了选择更好的繁殖种子,我们通过降维来选择顶级基因,例如泛癌融合基因中单个基因的六个特征的主成分或潜在层信息。我们的方法可能提供一种新的方法来评估癌症中的人类激酶。
    Kinase fusion genes are the most active fusion gene group in human cancer fusion genes. To help choose the clinically significant kinase so that the cancer patients that have fusion genes can be better diagnosed, we need a metric to infer the assessment of kinases in pan-cancer fusion genes rather than relying on the sample frequency expressed fusion genes. Most of all, multiple studies assessed human kinases as the drug targets using multiple types of genomic and clinical information, but none used the kinase fusion genes in their study. The assessment studies of kinase without kinase fusion gene events can miss the effect of one of the mechanisms that enhance the kinase function in cancer. To fill this gap, in this study, we suggest a novel way of assessing genes using a network propagation approach to infer how likely individual kinases influence the kinase fusion gene network composed of ~5K kinase fusion gene pairs. To select a better seed of propagation, we chose the top genes via dimensionality reduction like a principal component or latent layer information of six features of individual genes in pan-cancer fusion genes. Our approach may provide a novel way to assess of human kinases in cancer.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    虽然以前的研究表明,学生的心理变量与他们的高阶认知能力密切相关,像印度这样的第三世界国家基本上缺乏这样的研究,他们独特的社会经济文化挑战。我们的目的是调查心理变量(抑郁,焦虑和压力)和印度学生的认知功能,并根据这些变量预测认知表现。
    使用目的抽样系统地选择了四十三名大学生。广泛使用和验证的离线问卷用于评估他们的心理和认知状态。进行相关分析以检查这些变量之间的关联。应用人工神经网络(ANN)模型根据心理变量的得分来预测认知水平。
    相关分析显示情绪困扰和认知功能之间呈负相关。主成分分析(PCA)降低了输入数据的维数,用更少的特征有效地捕获方差。特征权重分析表明每个心理健康症状的均衡贡献,特别强调其中一个症状。人工神经网络模型表现出中等的预测性能,根据心理变量解释认知水平的一部分差异。
    该研究证实了大学生的情绪状态与认知能力之间的显着关联。具体来说,我们首次提供证据表明,在印度学生中,自我报告的压力水平较高,焦虑,抑郁症与认知测试中的较低表现有关。PCA和特征权重分析的应用为预测模型的结构提供了更深入的见解。值得注意的是,ANN模型的使用提供了作为情感属性的函数来预测这些认知领域的见解。我们的结果强调了解决心理健康问题和实施干预措施以增强大学生认知功能的重要性。
    UNASSIGNED: While previous studies have suggested close association of psychological variables of students withtheir higher-order cognitive abilities, such studies have largely been lacking for third world countries like India, with their unique socio-economic-cultural set of challenges. We aimed to investigate the relationship between psychological variables (depression, anxiety and stress) and cognitive functions among Indian students, and to predict cognitive performance as a function of these variables.
    UNASSIGNED: Four hundred and thirteen university students were systematically selected using purposive sampling. Widely used and validated offline questionnaires were used to assess their psychological and cognitive statuses. Correlational analyses were conducted to examine the associations between these variables. An Artificial Neural Network (ANN) model was applied to predict cognitive levels based on the scores of psychological variables.
    UNASSIGNED: Correlational analyses revealed negative correlations between emotional distress and cognitive functioning. Principal Component Analysis (PCA) reduced the dimensionality of the input data, effectively capturing the variance with fewer features. The feature weight analysis indicated a balanced contribution of each mental health symptom, with particular emphasis on one of the symptoms. The ANN model demonstrated moderate predictive performance, explaining a portion of the variance in cognitive levels based on the psychological variables.
    UNASSIGNED: The study confirms significant associations between emotional statuses of university students with their cognitive abilities. Specifically, we provide evidence for the first time that in Indian students, self-reported higher levels of stress, anxiety, and depression are linked to lower performance in cognitive tests. The application of PCA and feature weight analysis provided deeper insights into the structure of the predictive model. Notably, use of the ANN model provided insights into predicting these cognitive domains as a function of the emotional attributes. Our results emphasize the importance of addressing mental health concerns and implementing interventions for the enhancement of cognitive functions in university students.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在听力学领域,实现听觉障碍的准确辨别仍然是一个巨大的挑战。耳聋和耳鸣等情况对患者的整体生活质量产生重大影响,强调迫切需要精确有效的分类方法。这项研究引入了一种创新的方法,利用从三个不同队列获得的多视图脑网络数据:51名聋哑患者,54伴有耳鸣,和42个正常对照。精心收集脑电图(EEG)记录数据,聚焦于连接到具有10个感兴趣区域(ROI)的端到端密钥的70个电极。这些数据与机器学习算法协同集成。为了解决大脑连接数据固有的高维性质,主成分分析(PCA)用于特征约简,增强可解释性。所提出的方法使用集成学习技术进行评估,包括随机森林,额外的树木,梯度提升,和CatBoost。建议的模型的性能经过了一系列全面的指标审查,包括交叉验证准确性(CVA),精度,召回,F1分数,Kappa,和马修斯相关系数(MCC)。所提出的模型显示出统计意义,并有效地诊断听觉障碍,有助于早期发现和个性化治疗,从而提高患者的治疗效果和生活质量。值得注意的是,它们表现出可靠性和鲁棒性,具有高Kappa和MCC值。这项研究代表了听力学交叉的重大进展,神经影像学,和机器学习,对临床实践和护理具有变革性意义。
    In the field of audiology, achieving accurate discrimination of auditory impairments remains a formidable challenge. Conditions such as deafness and tinnitus exert a substantial impact on patients\' overall quality of life, emphasizing the urgent need for precise and efficient classification methods. This study introduces an innovative approach, utilizing Multi-View Brain Network data acquired from three distinct cohorts: 51 deaf patients, 54 with tinnitus, and 42 normal controls. Electroencephalogram (EEG) recording data were meticulously collected, focusing on 70 electrodes attached to an end-to-end key with 10 regions of interest (ROI). This data is synergistically integrated with machine learning algorithms. To tackle the inherently high-dimensional nature of brain connectivity data, principal component analysis (PCA) is employed for feature reduction, enhancing interpretability. The proposed approach undergoes evaluation using ensemble learning techniques, including Random Forest, Extra Trees, Gradient Boosting, and CatBoost. The performance of the proposed models is scrutinized across a comprehensive set of metrics, encompassing cross-validation accuracy (CVA), precision, recall, F1-score, Kappa, and Matthews correlation coefficient (MCC). The proposed models demonstrate statistical significance and effectively diagnose auditory disorders, contributing to early detection and personalized treatment, thereby enhancing patient outcomes and quality of life. Notably, they exhibit reliability and robustness, characterized by high Kappa and MCC values. This research represents a significant advancement in the intersection of audiology, neuroimaging, and machine learning, with transformative implications for clinical practice and care.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    特征选择是机器学习和数据挖掘的关键组成部分,它解决了诸如不相关之类的挑战,噪音,大规模数据的冗余等。,这往往会导致维度的诅咒。本研究采用K最近邻包装器,使用六种自然启发算法实现特征选择,源自人类行为和哺乳动物启发的技术。在六个现实世界的数据集上评估,这项研究旨在比较这些算法在准确性方面的性能,特征计数,健身,收敛性和计算成本。这些发现强调了人类学习优化的有效性,跨多个性能指标的差而丰富的优化和灰狼优化器算法。例如,为了卑鄙的健身,人类学习优化优于其他人,其次是可怜和丰富的优化和和谐搜索。这项研究表明了人类启发算法的潜力,特别是差的和丰富的优化,在不影响分类精度的情况下进行鲁棒特征选择。
    Feature selection is a critical component of machine learning and data mining which addresses challenges like irrelevance, noise, redundancy in large-scale data etc., which often result in the curse of dimensionality. This study employs a K-nearest neighbour wrapper to implement feature selection using six nature-inspired algorithms, derived from human behaviour and mammal-inspired techniques. Evaluated on six real-world datasets, the study aims to compare the performance of these algorithms in terms of accuracy, feature count, fitness, convergence and computational cost. The findings underscore the efficacy of the Human Learning Optimization, Poor and Rich Optimization and Grey Wolf Optimizer algorithms across multiple performance metrics. For instance, for mean fitness, Human Learning Optimization outperforms the others, followed by Poor and Rich Optimization and Harmony Search. The study suggests the potential of human-inspired algorithms, particularly Poor and Rich Optimization, in robust feature selection without compromising classification accuracy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    乳腺癌是全球女性中第二常见的癌症,病理学家的诊断是一个耗时且主观的过程。计算机辅助诊断框架通过自动分类数据来减轻病理学家的工作量,其中深度卷积神经网络(CNN)是有效的解决方案。从预先训练的CNN的激活层提取的特征称为深度卷积激活特征(DeCAF)。在本文中,我们已经分析了所有的DeCAF特征在分类任务中不一定会导致更高的准确性,降维起着重要的作用。为此,我们提出了减少的DeCAF(R-DeCAF),并应用不同的降维方法,通过捕捉DeCAF特征的本质,实现特征的有效组合。这个框架使用预先训练的CNN,如AlexNet,VGG-16和VGG-19作为迁移学习模式下的特征提取器。DeCAF特征是从上述CNN的第一个全连接层中提取的,并采用支持向量机进行分类。在线性和非线性降维算法中,诸如主成分分析(PCA)的线性方法代表了深层特征之间的更好组合,并且在考虑特征的特定量的累积解释方差(CEV)的使用少量特征的分类任务中导致更高的准确度。使用实验BreakHis和ICIAR数据集验证了所提出的方法。综合结果表明,在特征向量大小(FVS)为23和CEV等于0.15的情况下,分类精度提高了4.3%。
    Breast cancer is the second most common cancer among women worldwide, and the diagnosis by pathologists is a time-consuming procedure and subjective. Computer-aided diagnosis frameworks are utilized to relieve pathologist workload by classifying the data automatically, in which deep convolutional neural networks (CNNs) are effective solutions. The features extracted from the activation layer of pre-trained CNNs are called deep convolutional activation features (DeCAF). In this paper, we have analyzed that all DeCAF features are not necessarily led to higher accuracy in the classification task and dimension reduction plays an important role. We have proposed reduced DeCAF (R-DeCAF) for this purpose, and different dimension reduction methods are applied to achieve an effective combination of features by capturing the essence of DeCAF features. This framework uses pre-trained CNNs such as AlexNet, VGG-16, and VGG-19 as feature extractors in transfer learning mode. The DeCAF features are extracted from the first fully connected layer of the mentioned CNNs, and a support vector machine is used for classification. Among linear and nonlinear dimensionality reduction algorithms, linear approaches such as principal component analysis (PCA) represent a better combination among deep features and lead to higher accuracy in the classification task using a small number of features considering a specific amount of cumulative explained variance (CEV) of features. The proposed method is validated using experimental BreakHis and ICIAR datasets. Comprehensive results show improvement in the classification accuracy up to 4.3% with a feature vector size (FVS) of 23 and CEV equal to 0.15.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    自从我们使用磁共振成像(MRI)来检测脑部疾病以来已经有很长一段时间了,并且已经开发了许多有用的技术来完成这项任务。然而,为了确定结果,仍有可能进一步改进脑部疾病的分类。在我们提出的这项研究中,第一次,一种从MRI子图像中提取非线性特征的方法,该方法是从三维双树复小波变换(2DDT-CWT)的三个层次中获得的,以便对多种脑部疾病进行分类。从子图像中提取非线性特征后,我们使用谱回归判别分析(SRDA)算法来减少分类特征。而不是使用计算昂贵的深度神经网络,我们提出了混合RBF网络,该网络在其结构中同时使用k均值和递归最小二乘(RLS)算法进行分类。为了评估具有混合学习算法的RBF网络的性能,我们使用这些网络根据MRI处理对九种脑部疾病进行分类,并将结果与先前提出的分类器进行比较,包括,支持向量机(SVM)和K最近邻(KNN)。通过提取各种类型和数量的特征,与最近提出的案例进行综合比较。我们在本文中的目的是使用混合RBF分类器降低复杂性并改善分类结果,并且结果显示在两类和8和10类脑疾病的多重分类中均具有100%的分类精度。在本文中,我们提供了一种低计算和精确的脑MRI疾病分类方法。结果表明,该方法不仅准确,而且计算合理。
    It has been a long time since we use magnetic resonance imaging (MRI) to detect brain diseases and many useful techniques have been developed for this task. However, there is still a potential for further improvement of classification of brain diseases in order to be sure of the results. In this research we presented, for the first time, a non-linear feature extraction method from the MRI sub-images that are obtained from the three levels of the two-dimensional Dual tree complex wavelet transform (2D DT-CWT) in order to classify multiple brain disease. After extracting the non-linear features from the sub-images, we used the spectral regression discriminant analysis (SRDA) algorithm to reduce the classifying features. Instead of using the deep neural networks that are computationally expensive, we proposed the Hybrid RBF network that uses the k-means and recursive least squares (RLS) algorithm simultaneously in its structure for classification. To evaluate the performance of RBF networks with hybrid learning algorithms, we classify nine brain diseases based on MRI processing using these networks, and compare the results with the previously presented classifiers including, supporting vector machines (SVM) and K-nearest neighbour (KNN). Comprehensive comparisons are made with the recently proposed cases by extracting various types and numbers of features. Our aim in this paper is to reduce the complexity and improve the classifying results with the hybrid RBF classifier and the results showed 100 percent classification accuracy in both the two class and the multiple classification of brain diseases in 8 and 10 classes. In this paper, we provided a low computational and precise method for brain MRI disease classification. the results show that the proposed method is not only accurate but also computationally reasonable.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:表观基因组学领域在理解和治疗疾病方面具有巨大的前景,机器学习(ML)和人工智能的进步在这一追求中至关重要。越来越多,现在的研究利用胞嘧啶-鸟嘌呤二核苷酸(CpG)的DNA甲基化措施来检测疾病并估计衰老等生物学特征。鉴于DNA甲基化数据的高维性的挑战,通常采用特征选择技术来减少维度并识别最重要的特征子集。在这项研究中,我们的目的是测试和比较一系列特征选择方法和ML算法在开发基于DNA甲基化的端粒长度(TL)估计器中的应用.我们利用嵌套交叉验证和两个独立的测试集进行比较。
    结果:我们发现,当使用嵌套交叉验证分析和两个独立的测试队列进行评估时,弹性网络回归之前的主成分分析导致总体性能最佳的估计器。这种方法在EXTEND测试数据集上实现了0.295(83.4%CI[0.201,0.384])的估计和实际TL之间的相关性。相反,没有先前特征减少阶段的弹性网络回归的基线模型在总体上表现不佳,这表明先前的特征选择阶段可能具有重要的实用性。以前开发的TL估计器,DNAmTL,扩展数据的相关性为0.216(83.4%CI[0.118,0.310])。此外,我们观察到不同的基于DNA甲基化的TL估计,很少有常见的CpG,与许多相同的生物实体有关。
    结论:测试方法的性能差异表明,估计器对数据集异质性敏感,基于DNA甲基化的最佳估计器的开发应受益于本研究中使用的稳健方法学方法。此外,我们利用一系列特征选择方法和ML算法的方法可以应用于其他生物标记和疾病表型,检查它们与DNA甲基化和预测价值的关系。
    BACKGROUND: The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine-guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons.
    RESULTS: We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general-suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities.
    CONCLUSIONS: The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    DNA合成在合成生物学中广泛用于构建和组装从短RBS到超长合成基因组的序列。许多序列特征,如GC含量和重复序列,已知会影响合成难度和随后的合成成本。此外,有潜在的序列特征,特别是序列的局部特征,这也可能影响DNA合成过程。对给定序列的合成难度的可靠预测对于降低成本很重要。但这仍然是一个挑战。在这项研究中,我们提出了一种新的自动机器学习(AutoML)方法来预测DNA合成难度,它的F1得分为0.930,优于当前最先进的模型。我们发现了在以前的方法中被忽略的局部序列特征,这也可能影响DNA合成的难度。此外,基于大肠杆菌菌株MG1655的十个基因的实验验证表明,我们的模型可以达到80%的准确率,这也比艺术更好。此外,为了方便最终用户,我们使用完全基于云的无服务器架构开发了云平台SCP4SSD。
    DNA synthesis is widely used in synthetic biology to construct and assemble sequences ranging from short RBS to ultra-long synthetic genomes. Many sequence features, such as the GC content and repeat sequences, are known to affect the synthesis difficulty and subsequently the synthesis cost. In addition, there are latent sequence features, especially local characteristics of the sequence, which might affect the DNA synthesis process as well. Reliable prediction of the synthesis difficulty for a given sequence is important for reducing the cost, but this remains a challenge. In this study, we propose a new automated machine learning (AutoML) approach to predict the DNA synthesis difficulty, which achieves an F1 score of 0.930 and outperforms the current state-of-the-art model. We found local sequence features that were neglected in previous methods, which might also affect the difficulty of DNA synthesis. Moreover, experimental validation based on ten genes of Escherichia coli strain MG1655 shows that our model can achieve an 80% accuracy, which is also better than the state of art. Moreover, we developed the cloud platform SCP4SSD using an entirely cloud-based serverless architecture for the convenience of the end users.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    疾病表型的特征在于体征(医生在检查患者期间观察到的)和症状(患者对医生的抱怨)。大型疾病表型存储库可通过人的在线孟德尔遗传访问,人类表型本体论,和Orphadata计划。这些数据集中的许多疾病都是神经系统疾病。对于每个存储库,神经系统疾病的表型表示为可变长度的概念列表,其中概念是从受限本体中选择的。不提供这些概念列表的可视化。我们通过使用包含将描述性特征的数量从2,946个类减少到30个超类来解决这一限制。将可变长度的表型特征列表转换成固定长度的载体。将表型载体聚集到矩阵中并可视化为允许并排疾病比较的热图。个体疾病(表示矩阵中的一行)被可视化为字云。我们通过从Orphadata可视化32种肌张力障碍疾病的神经表型来说明这种方法的实用性。吸收可以将表型特征分解为超类,表型列表可以矢量化,和表型向量可以可视化为热图和词云。
    Disease phenotypes are characterized by signs (what a physician observes during the examination of a patient) and symptoms (the complaints of a patient to a physician). Large repositories of disease phenotypes are accessible through the Online Mendelian Inheritance of Man, Human Phenotype Ontology, and Orphadata initiatives. Many of the diseases in these datasets are neurologic. For each repository, the phenotype of neurologic disease is represented as a list of concepts of variable length where the concepts are selected from a restricted ontology. Visualizations of these concept lists are not provided. We address this limitation by using subsumption to reduce the number of descriptive features from 2,946 classes into thirty superclasses. Phenotype feature lists of variable lengths were converted into fixed-length vectors. Phenotype vectors were aggregated into matrices and visualized as heat maps that allowed side-by-side disease comparisons. Individual diseases (representing a row in the matrix) were visualized as word clouds. We illustrate the utility of this approach by visualizing the neuro-phenotypes of 32 dystonic diseases from Orphadata. Subsumption can collapse phenotype features into superclasses, phenotype lists can be vectorized, and phenotypes vectors can be visualized as heat maps and word clouds.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号