tensor decomposition

张量分解
  • 文章类型: Journal Article
    需要有效的探索和分析工具来从大规模单细胞测量数据中提取见解。然而,当前处理跨实验条件进行的单细胞研究的技术(例如,样品,扰动,或患者)需要限制性假设,缺乏灵活性,或者不能充分地从细胞到细胞的变化去卷积条件到条件的变化。这里,我们报告说,张量分解方法PARAFAC2(Pf2)可以实现跨条件的单细胞数据的降维。我们在外周免疫细胞的单细胞RNA测序(scRNA-seq)实验的两个不同背景下证明了这些益处:药物扰动和系统性红斑狼疮(SLE)患者样品。通过跨细胞和条件分离相关基因模块,Pf2能够直接关联特定患者或扰动的基因变异模式,同时将每个协调变化连接到某些细胞,而无需预先定义细胞类型。Pf2的理论基础为与单细胞数据相关的许多建模任务提供了统一的框架。因此,Pf2为跨不同生物环境的多样本单细胞研究提供了直观的通用降维方法。
    PARAFAC2实现了基于张量的跨条件单细胞实验分析。PARAFAC2将条件特异性效应与细胞间变化分开。PARAFAC2将模式直观地隔离为条件-,cell-,和基因特异性模式。
    Effective tools for exploration and analysis are needed to extract insights from large-scale single-cell measurement data. However, current techniques for handling single-cell studies performed across experimental conditions (e.g., samples, perturbations, or patients) require restrictive assumptions, lack flexibility, or do not adequately deconvolute condition-to-condition variation from cell-to-cell variation. Here, we report that the tensor decomposition method PARAFAC2 (Pf2) enables the dimensionality reduction of single-cell data across conditions. We demonstrate these benefits across two distinct contexts of single-cell RNA-sequencing (scRNA-seq) experiments of peripheral immune cells: pharmacologic drug perturbations and systemic lupus erythematosus (SLE) patient samples. By isolating relevant gene modules across cells and conditions, Pf2 enables straightforward associations of gene variation patterns across specific patients or perturbations while connecting each coordinated change to certain cells without pre-defining cell types. The theoretical grounding of Pf2 suggests a unified framework for many modeling tasks associated with single-cell data. Thus, Pf2 provides an intuitive universal dimensionality reduction approach for multi-sample single-cell studies across diverse biological contexts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在过去的十年里,张量在信号和图像处理领域的不同方面变得越来越有吸引力。主要原因是矩阵在表示和分析多模态和多维数据集时效率低下。矩阵无法保留高阶数据集中元素的多维相关性,这大大降低了基于矩阵的方法分析多维数据集的有效性。除此之外,基于张量的方法已经证明了有希望的性能。这些在一起,鼓励研究人员从矩阵转向张量。在不同的信号和图像处理应用中,分析生物医学信号和图像是特别重要的。这是由于需要从直接影响患者健康的生物医学数据集中提取准确的信息。此外,在许多情况下,同时记录了一名患者的多个数据集.一个常见的例子是记录精神分裂症患者的脑电图(EEG)和功能磁共振成像(fMRI)。在这种情况下,张量似乎是同时利用两个(或多个)数据集的最有效方法之一。因此,已经开发了几种基于张量的方法来分析生物医学数据集。考虑到这个现实,在本文中,我们的目标是对生物医学图像分析中基于张量的方法进行全面回顾。所提出的不同方法和应用之间的研究和分类可以显示张量在生物医学图像增强中的重要性,并为未来的研究开辟了新的途径。
    In the past decade, tensors have become increasingly attractive in different aspects of signal and image processing areas. The main reason is the inefficiency of matrices in representing and analyzing multimodal and multidimensional datasets. Matrices cannot preserve the multidimensional correlation of elements in higher-order datasets and this highly reduces the effectiveness of matrix-based approaches in analyzing multidimensional datasets. Besides this, tensor-based approaches have demonstrated promising performances. These together, encouraged researchers to move from matrices to tensors. Among different signal and image processing applications, analyzing biomedical signals and images is of particular importance. This is due to the need for extracting accurate information from biomedical datasets which directly affects patient\'s health. In addition, in many cases, several datasets have been recorded simultaneously from a patient. A common example is recording electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) of a patient with schizophrenia. In such a situation, tensors seem to be among the most effective methods for the simultaneous exploitation of two (or more) datasets. Therefore, several tensor-based methods have been developed for analyzing biomedical datasets. Considering this reality, in this paper, we aim to have a comprehensive review on tensor-based methods in biomedical image analysis. The presented study and classification between different methods and applications can show the importance of tensors in biomedical image enhancement and open new ways for future studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由神经退行性疾病如阿尔茨海默病和路易体病引起的电生理紊乱可通过头皮脑电图检测到,并可作为疾病严重程度的功能量度。传统的脑电定量分析方法往往需要对有临床意义的脑电特征进行先验选择,容易产生偏倚,限制了常规脑电图在神经退行性疾病的诊断和治疗中的临床应用。我们提出了一种数据驱动的张量分解方法,以提取前6个光谱和空间特征,这些特征代表了闭眼清醒期间EEG活动的公知来源。作为他们在梅奥诊所的神经系统评估的一部分,11001例患者接受了12176例常规,标准10-20头皮脑电图研究。从这些原始脑电图中,我们开发了一种基于后alpha活动和眼球运动的算法,可以自动选择清醒的闭眼时间,并为每个通道估计1至45Hz的平均频谱功率密度(SPD)。然后,我们创建了一个三维(3D)张量(记录×通道×频率),并应用了规范的多元分解来提取前六个因子。我们进一步确定了符合阿尔茨海默病和路易体痴呆(31)所致轻度认知障碍(30)或痴呆(39)共识标准的独立患者队列和类似年龄认知正常对照(36)。我们使用NaiveBayes分类方法评估了六个因素在区分这些亚组方面的能力,并评估了因素负荷与Kokmen精神状态分数简短测试之间的线性关联,氟脱氧葡萄糖(FDG)PET摄取率和CSF阿尔茨海默病生物标志物测量。代表生物学上有意义的大脑活动的因素,包括后α节律,前δ/θ节律和中心顶叶β,与患者年龄和脑电图心律失常分级相关。这些因素还能够以中等至高度的准确性(曲线下面积(AUC)0.59-0.91)和阿尔茨海默病痴呆与路易体痴呆(AUC0.61)区分患者与对照组。此外,相关脑电图特征与认知测试表现相关,阿尔兹海默症亚组的PET代谢和CSFAB42测量。这项研究表明,数据驱动的方法可以从人群水平的临床EEG中提取生物学上有意义的特征,而无需人为排斥或先验选择通道或频带。随着持续发展,这种数据驱动的方法可以通过帮助早期识别轻度认知障碍和区分认知障碍的不同神经退行性原因来提高脑电图在记忆护理中的临床应用。
    Electrophysiologic disturbances due to neurodegenerative disorders such as Alzheimer\'s disease and Lewy Body disease are detectable by scalp EEG and can serve as a functional measure of disease severity. Traditional quantitative methods of EEG analysis often require an a-priori selection of clinically meaningful EEG features and are susceptible to bias, limiting the clinical utility of routine EEGs in the diagnosis and management of neurodegenerative disorders. We present a data-driven tensor decomposition approach to extract the top 6 spectral and spatial features representing commonly known sources of EEG activity during eyes-closed wakefulness. As part of their neurologic evaluation at Mayo Clinic, 11 001 patients underwent 12 176 routine, standard 10-20 scalp EEG studies. From these raw EEGs, we developed an algorithm based on posterior alpha activity and eye movement to automatically select awake-eyes-closed epochs and estimated average spectral power density (SPD) between 1 and 45 Hz for each channel. We then created a three-dimensional (3D) tensor (record × channel × frequency) and applied a canonical polyadic decomposition to extract the top six factors. We further identified an independent cohort of patients meeting consensus criteria for mild cognitive impairment (30) or dementia (39) due to Alzheimer\'s disease and dementia with Lewy Bodies (31) and similarly aged cognitively normal controls (36). We evaluated the ability of the six factors in differentiating these subgroups using a Naïve Bayes classification approach and assessed for linear associations between factor loadings and Kokmen short test of mental status scores, fluorodeoxyglucose (FDG) PET uptake ratios and CSF Alzheimer\'s Disease biomarker measures. Factors represented biologically meaningful brain activities including posterior alpha rhythm, anterior delta/theta rhythms and centroparietal beta, which correlated with patient age and EEG dysrhythmia grade. These factors were also able to distinguish patients from controls with a moderate to high degree of accuracy (Area Under the Curve (AUC) 0.59-0.91) and Alzheimer\'s disease dementia from dementia with Lewy Bodies (AUC 0.61). Furthermore, relevant EEG features correlated with cognitive test performance, PET metabolism and CSF AB42 measures in the Alzheimer\'s subgroup. This study demonstrates that data-driven approaches can extract biologically meaningful features from population-level clinical EEGs without artefact rejection or a-priori selection of channels or frequency bands. With continued development, such data-driven methods may improve the clinical utility of EEG in memory care by assisting in early identification of mild cognitive impairment and differentiating between different neurodegenerative causes of cognitive impairment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    作为减少高光谱图像(HSI)的噪声和冗余信息的有效策略,高光谱波段选择旨在选择原始高光谱波段的子集,这增加了后续的不同任务。在本文中,我们引入了一种多维高阶结构保留聚类方法,用于高光谱波段选择,简称为MHSPC。通过将原始高光谱图像视为张量立方体,我们对其应用张量CP(CANDECOMP/PARAFAC)分解来利用多维结构信息并生成低维潜在特征表示。为了捕获沿光谱维度的局部几何结构,在低维空间中对新的特征表示施加图正则化。此外,由于HSI的低等级是重要的全球财产,我们利用潜在特征表示矩阵上的核范数约束来捕获全局数据结构信息。与大多数以前的基于聚类的高光谱波段选择方法不同,该方法将每个波段作为矢量进行矢量化而不考虑2-D空间信息,提出的MHSPC可以从局部和全局角度有效地捕获原始高光谱立方体的空间结构和光谱相关性。设计了一种具有理论收敛性保证的高效交替更新算法来求解结果优化问题,在四个基准数据集上的大量实验结果验证了所提出的MHSPC相对于其他现有技术的有效性。
    As an effective strategy for reducing the noisy and redundant information for hyperspectral imagery (HSI), hyperspectral band selection intends to select a subset of original hyperspectral bands, which boosts the subsequent different tasks. In this paper, we introduce a multi-dimensional high-order structure preserved clustering method for hyperspectral band selection, referred to as MHSPC briefly. By regarding original hyperspectral images as a tensor cube, we apply the tensor CP (CANDECOMP/PARAFAC) decomposition on it to exploit the multi-dimensional structural information as well as generate a low-dimensional latent feature representation. In order to capture the local geometrical structure along the spectral dimension, a graph regularizer is imposed on the new feature representation in the lower dimensional space. In addition, since the low rankness of HSIs is an important global property, we utilize a nuclear norm constraint on the latent feature representation matrix to capture the global data structure information. Different to most of previous clustering based hyperspectral band selection methods which vectorize each band as a vector without considering the 2-D spatial information, the proposed MHSPC can effectively capture the spatial structure as well as the spectral correlation of original hyperspectral cube in both local and global perspectives. An efficient alternatively updating algorithm with theoretical convergence guarantee is designed to solve the resultant optimization problem, and extensive experimental results on four benchmark datasets validate the effectiveness of the proposed MHSPC over other state-of-the-arts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:帕金森病(PD)是我们老龄化社会中常见的进行性神经退行性疾病。早期PD生物标志物需要及时的临床干预和病理生理学的理解。由于PD的特征之一是黑质致密质中多巴胺能神经元的进行性丢失,我们提出了一种特征提取方法,用于分析PD和非PD患者之间黑质的差异。
    方法:我们提出了一种基于秩-1张量分解的体积图像特征提取方法。此外,我们应用了一种特征选择方法,该方法排除了PD和非PD之间的共同特征。我们收集了263名患者的神经黑色素图像:124名PD患者和139名非PD患者,并将其分为训练和测试数据集进行实验。然后,我们使用提出的特征提取方法和线性判别分析,通过实验评估PD和非PD患者之间黑质的分类精度。
    结果:对于我们的66名非PD和42名PD患者的测试数据集,所提出的方法实现了0.72的灵敏度和0.64的特异性。此外,我们通过秩1张量与选定特征的线性组合来可视化黑质中的重要模式。可视化的图案包括腹横向层,在PD中可以观察到神经元的严重丧失。
    结论:我们开发了一种新的特征提取方法,用于分析黑质,以诊断PD。在实验中,即使所提出的特征提取方法和线性判别分析的分类精度低于专家医师,结果表明了张量特征提取的潜力。
    OBJECTIVE: Parkinson disease (PD) is a common progressive neurodegenerative disorder in our ageing society. Early-stage PD biomarkers are desired for timely clinical intervention and understanding of pathophysiology. Since one of the characteristics of PD is the progressive loss of dopaminergic neurons in the substantia nigra pars compacta, we propose a feature extraction method for analysing the differences in the substantia nigra between PD and non-PD patients.
    METHODS: We propose a feature-extraction method for volumetric images based on a rank-1 tensor decomposition. Furthermore, we apply a feature selection method that excludes common features between PD and non-PD. We collect neuromelanin images of 263 patients: 124 PD and 139 non-PD patients and divide them into training and testing datasets for experiments. We then experimentally evaluate the classification accuracy of the substantia nigra between PD and non-PD patients using the proposed feature extraction method and linear discriminant analysis.
    RESULTS: The proposed method achieves a sensitivity of 0.72 and a specificity of 0.64 for our testing dataset of 66 non-PD and 42 PD patients. Furthermore, we visualise the important patterns in the substantia nigra by a linear combination of rank-1 tensors with selected features. The visualised patterns include the ventrolateral tier, where the severe loss of neurons can be observed in PD.
    CONCLUSIONS: We develop a new feature-extraction method for the analysis of the substantia nigra towards PD diagnosis. In the experiments, even though the classification accuracy with the proposed feature extraction method and linear discriminant analysis is lower than that of expert physicians, the results suggest the potential of tensorial feature extraction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    执行功能对于适应性行为至关重要。一种执行功能是所谓的“干扰控制”或冲突监测,另一种是抑制性控制(即,行动抑制和行动取消)。最近的证据表明这些过程的相互作用,考虑到较新的概念框架意味着名义上不同的动作/反应控制过程可以通过一小组认知和神经生理过程来解释,这在概念上是相关的。尚未直接检查这种总体神经原理的存在。在目前的研究中,因此,我们使用EEG张量分解方法,研究冲突调节的作用抑制和作用取消可能的共同神经生理特征作为反应抑制的潜在机制。我们展示了冲突如何不同地调节动作约束和动作取消过程,并描绘了这种相互作用背后的共同和不同的神经过程。关于空间信息调制,在枕枕电极反映的过程的重要性方面是相似的,这表明注意选择过程发挥作用。特别是theta和alpha活性似乎起着重要作用。数据还表明,张量分解对任务实现的方式很敏感,因此,建议在选择张量分解作为分析方法时,应考虑切换概率/过渡概率。该研究提供了如何使用张量分解方法使用EEG数据描绘动作控制功能的共同和不同神经机制的蓝图。
    Executive functions are essential for adaptive behavior. One executive function is the so-called \'interference control\' or conflict monitoring another one is inhibitory control (i.e., action restraint and action cancelation). Recent evidence suggests an interplay of these processes, which is conceptually relevant given that newer conceptual frameworks imply that nominally different action/response control processes are explainable by a small set of cognitive and neurophysiological processes. The existence of such overarching neural principles has as yet not directly been examined. In the current study, we therefore use EEG tensor decomposition methods, to look into possible common neurophysiological signatures underlying conflict-modulated action restraint and action cancelation as mechanism underlying response inhibition. We show how conflicts differentially modulate action restraint and action cancelation processes and delineate common and distinct neural processes underlying this interplay. Concerning the spatial information modulations are similar in terms of an importance of processes reflected by parieto-occipital electrodes, suggesting that attentional selection processes play a role. Especially theta and alpha activity seem to play important roles. The data also show that tensor decomposition is sensitive to the manner of task implementation, thereby suggesting that switch probability/transitional probabilities should be taken into consideration when choosing tensor decomposition as analysis method. The study provides a blueprint of how to use tensor decomposition methods to delineate common and distinct neural mechanisms underlying action control functions using EEG data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    根据基于物理的模型编码的许多复杂化学问题对于传统数值方法而言在计算上变得棘手,因为它们随着分子大小的增加而不利地缩放。张量分解技术可以通过将化学问题的无法实现的大数值表示分解为更小的,易处理的。在本世纪头二十年,基于这种张量分解的算法已经成为计算化学各个分支中最先进的方法,从分子量子动力学到电子结构理论和机器学习。这里,我们考虑张量分解方案在扩展计算化学范围中所起的作用。我们将一些最突出的方法与它们常见的底层张量网络形式主义联系起来,为化学和材料科学中基于张量的领先方法提供统一的视角。
    Many complex chemical problems encoded in terms of physics-based models become computationally intractable for traditional numerical approaches due to their unfavorable scaling with increasing molecular size. Tensor decomposition techniques can overcome such challenges by decomposing unattainably large numerical representations of chemical problems into smaller, tractable ones. In the first two decades of this century, algorithms based on such tensor factorizations have become state-of-the-art methods in various branches of computational chemistry, ranging from molecular quantum dynamics to electronic structure theory and machine learning. Here, we consider the role that tensor decomposition schemes have played in expanding the scope of computational chemistry. We relate some of the most prominent methods to their common underlying tensor network formalisms, providing a unified perspective on leading tensor-based approaches in chemistry and materials science.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:电子健康记录是患者信息的宝贵来源,在与研究人员共享之前,必须对其进行适当的识别。这个过程需要专业知识和时间。此外,合成数据大大减少了对实际数据使用和共享的限制,允许研究人员以更少的隐私限制更快地访问它。因此,人们对建立一种生成合成数据的方法越来越感兴趣,该方法可以保护患者的隐私,同时正确反映数据。
    目的:本研究旨在开发和验证一种模型,该模型可生成有价值的合成纵向健康数据,同时保护收集数据的患者的隐私。
    方法:我们研究了生成综合健康数据的最佳模型,专注于纵向观察。我们开发了一个生成模型,该模型依赖于广义规范多元(GCP)张量分解。该模型还涉及从GCP分解的潜在因子矩阵中进行采样,其中包含患者因素,使用顺序决策树,copula,和哈密顿蒙特卡罗方法。我们将所提出的模型应用于来自MIMIC-III(版本1.4)数据集的样本。使用不同的数据结构和场景进行了许多分析和实验。我们通过进行效用评估来评估我们的合成数据和真实数据之间的相似性。这些评估评估了数据中存在的结构和一般模式,如依赖结构,描述性统计,和边际分布。关于隐私披露,我们的模型通过防止患者信息的直接共享和消除观察张量记录和模型张量记录之间的一对一联系来保护隐私.这是通过模拟和建模与患者相关的GCP分解的潜在因子矩阵来实现的。
    结果:研究结果表明,我们的模型是一种有前途的方法,用于生成与真实数据足够相似的合成纵向健康数据。它可以保护原始数据的效用和隐私,同时还可以处理各种数据结构和场景。在某些实验中,模型中使用的所有仿真方法都产生了相同的高水平性能。我们的模型还能够解决从电子健康记录中采样患者的挑战。这意味着我们可以在合成数据集中模拟各种患者,这可能与原始数据中的患者数量不同。
    结论:我们提出了一种生成综合纵向健康数据的生成模型。通过应用GCP张量分解来建立模型。我们已经提供了3种方法,用于在分解过程之后合成和模拟潜在因子矩阵。简而言之,我们已经将合成大量纵向健康数据的挑战减少到合成非纵向且明显较小的数据集。
    BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients\' privacy while properly reflecting the data.
    OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected.
    METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients.
    RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data.
    CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    近年来,数据驱动的细胞-细胞通信推断有助于揭示跨细胞类型的协调生物过程。这里,我们集成了两个工具,利亚纳和张量细胞2细胞,which,当合并时,可以部署多种现有方法和资源,以实现跨多个样本的小区-小区通信程序的稳健和灵活的识别。在这项工作中,我们展示了我们的工具的集成如何促进推断细胞-细胞通信的方法的选择,并随后执行无监督的去卷积以获得和总结生物学见解。我们解释了如何在Python和R中一步一步地执行分析,并提供在线教程,详细说明可在https://ccc协议中获得。readthedocs.io/.这个工作流程通常需要1.5h从安装到在图形处理单元启用的计算机上的下游可视化完成~63,000个细胞的数据集,10种细胞类型,12个样本
    In recent years, data-driven inference of cell-cell communication has helped reveal coordinated biological processes across cell types. Here, we integrate two tools, LIANA and Tensor-cell2cell, which, when combined, can deploy multiple existing methods and resources to enable the robust and flexible identification of cell-cell communication programs across multiple samples. In this work, we show how the integration of our tools facilitates the choice of method to infer cell-cell communication and subsequently perform an unsupervised deconvolution to obtain and summarize biological insights. We explain how to perform the analysis step by step in both Python and R and provide online tutorials with detailed instructions available at https://ccc-protocols.readthedocs.io/. This workflow typically takes ∼1.5 h to complete from installation to downstream visualizations on a graphics processing unit-enabled computer for a dataset of ∼63,000 cells, 10 cell types, and 12 samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    疾病相关基因的准确鉴定对于理解各种疾病的分子机制至关重要。目前大多数方法都集中在构建生物网络和利用机器学习,特别是深度学习,识别疾病基因。然而,这些方法忽略了生物知识图中实体之间的复杂关系。这些信息已成功应用于生命科学研究的其他领域,证明其有效性。知识图谱嵌入方法可以学习知识图谱内不同关系的语义信息。尽管如此,现有表示学习技术的性能,当应用于特定领域的生物学数据时,仍然是次优的。为了解决这些问题,我们构建了一个以疾病和基因为中心的生物学知识图谱,并使用交互式张量分解KDGene开发用于疾病基因预测的端到端知识图完成框架。KDGene包含一个交互模块,该模块在张量分解中桥接实体和关系嵌入,旨在提高语义上相似概念在特定领域的代表性,并增强准确预测疾病基因的能力。实验结果表明,KDGene的性能明显优于最先进的算法,无论是现有的疾病基因预测方法还是通用领域的知识图嵌入方法。此外,预测结果的综合生物学分析进一步验证了KDGene准确识别新候选基因的能力。这项工作提出了一个可扩展的知识图完成框架来识别疾病候选基因,研究结果有望为进一步的湿法实验提供有价值的参考。数据和源代码可在https://github.com/2020MEAI/KDGene获得。
    The accurate identification of disease-associated genes is crucial for understanding the molecular mechanisms underlying various diseases. Most current methods focus on constructing biological networks and utilizing machine learning, particularly deep learning, to identify disease genes. However, these methods overlook complex relations among entities in biological knowledge graphs. Such information has been successfully applied in other areas of life science research, demonstrating their effectiveness. Knowledge graph embedding methods can learn the semantic information of different relations within the knowledge graphs. Nonetheless, the performance of existing representation learning techniques, when applied to domain-specific biological data, remains suboptimal. To solve these problems, we construct a biological knowledge graph centered on diseases and genes, and develop an end-to-end knowledge graph completion framework for disease gene prediction using interactional tensor decomposition named KDGene. KDGene incorporates an interaction module that bridges entity and relation embeddings within tensor decomposition, aiming to improve the representation of semantically similar concepts in specific domains and enhance the ability to accurately predict disease genes. Experimental results show that KDGene significantly outperforms state-of-the-art algorithms, whether existing disease gene prediction methods or knowledge graph embedding methods for general domains. Moreover, the comprehensive biological analysis of the predicted results further validates KDGene\'s capability to accurately identify new candidate genes. This work proposes a scalable knowledge graph completion framework to identify disease candidate genes, from which the results are promising to provide valuable references for further wet experiments. Data and source codes are available at https://github.com/2020MEAI/KDGene.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号