gene expression prediction

  • 文章类型: Journal Article
    了解基因表达的调控机制是基因组学的重要目标。尽管转录起始位点(TSS)附近的DNA序列提供了有价值的见解,最近的方法表明,仅分析周围的DNA可能不足以准确预测基因表达水平。我们开发了GENet(来自组蛋白和转录因子整合的基因表达网络),一种新的方法,将转录因子和组蛋白修饰的基本调节信号整合到基于图形的模型中。GENet通过整合额外的遗传控制层,超越了简单的DNA序列分析,这对决定基因表达至关重要。与以前仅依赖于DNA序列数据的模型相比,我们的方法显着增强了mRNA水平的预测。结果强调了在基因表达研究中包括全面调控信息的重要性。GENet成为研究人员的一个有前途的工具,具有从基础生物学研究到医学疗法开发的潜在应用。
    Understanding the regulatory mechanisms of gene expression is a crucial objective in genomics. Although the DNA sequence near the transcription start site (TSS) offers valuable insights, recent methods suggest that analyzing only the surrounding DNA may not suffice to accurately predict gene expression levels. We developed GENet (Gene Expression Network from Histone and Transcription Factor Integration), a novel approach that integrates essential regulatory signals from transcription factors and histone modifications into a graph-based model. GENet extends beyond simple DNA sequence analysis by incorporating additional layers of genetic control, which are vital for determining gene expression. Our method markedly enhances the prediction of mRNA levels compared to previous models that depend solely on DNA sequence data. The results underscore the significance of including comprehensive regulatory information in gene expression studies. GENet emerges as a promising tool for researchers, with potential applications extending from fundamental biological research to the development of medical therapies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    染色质相互作用在基因组中的远端调控元件和靶基因之间产生空间接近,这对基因表达有重要影响,转录调控,和表型特征。迄今为止,已经开发了几种预测基因表达的方法。然而,现有的方法没有考虑染色质相互作用对靶基因表达的影响,从而潜在地降低了基因表达预测和重要调控元件挖掘的准确性。在这项研究中,基于玉米染色质相互作用数据,开发了一种基于深度学习的基因表达预测模型(DeepCBA)。与现有模型相比,DeepCBA在表达分类和表达值预测中表现出更高的准确性。使用基因启动子近端相互作用预测基因表达的平均皮尔逊相关系数(PCC),近端-远端相互作用,近端和远端相互作用分别为0.818、0.625和0.929,表示比仅使用基因近端序列的传统方法的PCC增加了0.357、0.16和0.469。通过DeepCBA鉴定了一些重要的基序,发现富集在开放染色质区域和表达数量性状基因座(eQTL)中,具有组织特异性的分子特征。重要的是,玉米开花相关基因ZmRap2.7和分till相关基因ZmTb1的实验结果证明了DeepCBA在探索影响基因表达的调控元件方面的可行性。此外,两个已报道基因(ZmCLE7,ZmVTE4)的启动子编辑和验证证明了DeepCBA在精确设计基因表达甚至未来智能育种方面的新见解。DeepCBA可在http://www上获得。deepcba.com/或http://124.220.197.196/。
    Chromatin interactions create spatial proximity between distal regulatory elements and target genes in the genome, which has an important impact on gene expression, transcriptional regulation, and phenotypic traits. To date, several methods have been developed for predicting gene expression. However, existing methods do not take into consideration the effect of chromatin interactions on target gene expression, thus potentially reducing the accuracy of gene expression prediction and mining of important regulatory elements. In this study, we developed a highly accurate deep learning-based gene expression prediction model (DeepCBA) based on maize chromatin interaction data. Compared with existing models, DeepCBA exhibits higher accuracy in expression classification and expression value prediction. The average Pearson correlation coefficients (PCCs) for predicting gene expression using gene promoter proximal interactions, proximal-distal interactions, and both proximal and distal interactions were 0.818, 0.625, and 0.929, respectively, representing an increase of 0.357, 0.16, and 0.469 over the PCCs obtained with traditional methods that use only gene proximal sequences. Some important motifs were identified through DeepCBA; they were enriched in open chromatin regions and expression quantitative trait loci and showed clear tissue specificity. Importantly, experimental results for the maize flowering-related gene ZmRap2.7 and the tillering-related gene ZmTb1 demonstrated the feasibility of DeepCBA for exploration of regulatory elements that affect gene expression. Moreover, promoter editing and verification of two reported genes (ZmCLE7 and ZmVTE4) demonstrated the utility of DeepCBA for the precise design of gene expression and even for future intelligent breeding. DeepCBA is available at http://www.deepcba.com/ or http://124.220.197.196/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Preprint
    基因表达反应的进化是适应可变环境的关键组成部分。预测DNA序列如何影响表达是具有挑战性的,因为基因型到表型图谱对于顺式调控元件没有很好的解决。转录因子结合,监管互动,和表观遗传特征,更不用说这些因素对环境的反应了。我们测试了灵活的机器学习模型是否可以学习一些潜在的顺式调节基因型到表型图谱。我们在5个不同的拟南芥种质中使用冷响应转录组谱测试了这种方法。我们首先测试了顺式调节在环境响应中起作用的证据,发现14个和15个基序在冷反应差异调节基因(DEGs)的上游和下游区域显着富集。我们接下来应用卷积神经网络(CNN),它学习DNA序列中的从头顺式调控基序,以预测对环境的表达反应。我们发现CNN以中等精度预测差异表达,有证据表明,生物调控的复杂性和巨大的潜在调控代码阻碍了预测。总的来说,可以根据顺式调控序列的变化来预测特定环境之间的DEG,尽管需要纳入更多信息,并且可能需要更好的模型。
    The evolution of gene expression responses are a critical component of adaptation to variable environments. Predicting how DNA sequence influences expression is challenging because the genotype to phenotype map is not well resolved for cis regulatory elements, transcription factor binding, regulatory interactions, and epigenetic features, not to mention how these factors respond to environment. We tested if flexible machine learning models could learn some of the underlying cis-regulatory genotype to phenotype map. We tested this approach using cold-responsive transcriptome profiles in 5 diverse Arabidopsis thaliana accessions. We first tested for evidence that cis regulation plays a role in environmental response, finding 14 and 15 motifs that were significantly enriched within the up- and down-stream regions of cold-responsive differentially regulated genes (DEGs). We next applied convolutional neural networks (CNNs), which learn de novo cis-regulatory motifs in DNA sequences to predict expression response to environment. We found that CNNs predicted differential expression with moderate accuracy, with evidence that predictions were hindered by biological complexity of regulation and the large potential regulatory code. Overall, DEGs between specific environments can be predicted based on variation in cis-regulatory sequences, although more information needs to be incorporated and better models may be required.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    空间转录组学(ST)提供了对肿瘤微环境(TME)的见解,这与癌症预后密切相关,但ST临床应用有限。在这项研究中,我们提供了一个强大的深度学习系统,以根据没有ST数据的患者的组织学图像来增强TME信息,从而赋予精确的癌症预后。该系统提供两个连接以桥接现有间隙。第一个是集成的图形和图像深度学习(IGI-DL)模型,根据组织学图像预测ST表达,与现有的五种方法相比,三种癌症类型的平均相关性增加了0.171。第二个联系是癌症预后预测模型,基于空间基因表达描述的TME。我们的生存模式,使用具有预测ST特征的图形,癌症基因组图谱乳腺癌和结直肠癌队列的一致性指数为0.747和0.725,优于其他生存模型。对于外部分子和细胞肿瘤学结直肠癌队列,我们的生存模式保持稳定的优势。
    Spatial transcriptomics (ST) provides insights into the tumor microenvironment (TME), which is closely associated with cancer prognosis, but ST has limited clinical availability. In this study, we provide a powerful deep learning system to augment TME information based on histological images for patients without ST data, thereby empowering precise cancer prognosis. The system provides two connections to bridge existing gaps. The first is the integrated graph and image deep learning (IGI-DL) model, which predicts ST expression based on histological images with a 0.171 increase in mean correlation across three cancer types compared with five existing methods. The second connection is the cancer prognosis prediction model, based on TME depicted by spatial gene expression. Our survival model, using graphs with predicted ST features, achieves superior accuracy with a concordance index of 0.747 and 0.725 for The Cancer Genome Atlas breast cancer and colorectal cancer cohorts, outperforming other survival models. For the external Molecular and Cellular Oncology colorectal cancer cohort, our survival model maintains a stable advantage.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    空间转录组学(ST),含有细粒度的基因表达(即,不同的窗口)组织样本内的空间位置,在开发创新疗法方面已经变得至关重要。传统的ST技术,然而,依靠昂贵的专业商业设备。解决这个问题,我们的文章旨在创建一个具有成本效益的,使用标准组织图像进行基因表达预测的虚拟ST方法,消除了对昂贵设备的需求。该领域的常规方法通常忽略不同样本窗口之间的长距离空间依赖性或需要先前的基因表达数据。为了克服这些限制,我们提出了边缘-关系窗口-注意网络(ErwaNet),通过从组织图像中捕获局部相互作用和全局结构信息来增强基因预测,没有先前的基因表达数据。ErwaNet创新地构造异构图以对局部窗口交互进行建模,并结合了用于全局信息分析的注意力机制。这种双重框架不仅为基因表达预测提供了一种经济有效的解决方案,而且消除了先验知识基因表达信息的必要性。在癌症研究领域的一个显著优势,它使一个更有效和可访问的分析范式。ErwaNet是一种无先验且易于实现的图形卷积网络(GCN)方法,用于从组织图像中预测基因表达。对两个公共乳腺癌数据集的评估表明,ErwaNet,没有额外的信息,优于最先进的(SOTA)方法。代码可在https://github.com/biyecc/ErwaNet上获得。
    Spatial transcriptomics (ST), containing gene expression with fine-grained (i.e., different windows) spatial location within tissue samples, has become vital in developing innovative treatments. Traditional ST technology, however, rely on costly specialized commercial equipment. Addressing this, our article aims to creates a cost-effective, virtual ST approach using standard tissue images for gene expression prediction, eliminating the need for expensive equipment. Conventional approaches in this field often overlook the long-distance spatial dependencies between different sample windows or need prior gene expression data. To overcome these limitations, we propose the Edge-Relational Window-Attentional Network (ErwaNet), enhancing gene prediction by capturing both local interactions and global structural information from tissue images, without prior gene expression data. ErwaNet innovatively constructs heterogeneous graphs to model local window interactions and incorporates an attention mechanism for global information analysis. This dual framework not only provides a cost-effective solution for gene expression predictions but also obviates the necessity of prior knowledge gene expression information, a significant advantage in the field of cancer research where it enables a more efficient and accessible analytical paradigm. ErwaNet stands out as a prior-free and easy-to-implement Graph Convolution Network (GCN) method for predicting gene expression from tissue images. Evaluation of the two public breast cancer datasets shows that ErwaNet, without additional information, outperforms the state-of-the-art (SOTA) methods. Code is available at https://github.com/biyecc/ErwaNet.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    大多数监管要素,尤其是增强子序列,是细胞群体特异性的。人们甚至可以争辩说,一组独特的调节元件是定义细胞群体的原因。然而,发现DNA的哪些非编码区域在哪些背景下是必不可少的,结果,哪些基因被表达,是一项艰巨的任务。一些计算模型通过直接从基因组序列预测基因表达来解决这个问题。这些模型目前仅限于预测批量测量,并且主要进行组织特异性预测。这里,我们提出了一个利用单细胞RNA测序数据预测基因表达的模型.我们表明细胞群体特异性模型优于组织特异性模型,特别是当细胞群和相应组织的表达谱不相似时。Further,我们表明我们的模型可以优先考虑GWAS变体并学习转录因子结合位点的基序。我们设想我们的模型可用于描绘细胞群体特异性调控元件。
    Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于成像的空间转录组学技术以单细胞分辨率提供有价值的空间和基因表达信息。然而,他们目前的能力仅限于分析每个样本有限数量的基因,导致大多数转录组仍未测量。为了克服这一挑战,我们开发了ENGEP,一种基于集成学习的工具,通过使用多个单细胞RNA测序数据集作为参考来预测空间转录组学数据中的未测量基因表达。ENGEP优于当前最先进的工具,并通过准确预测未测量的基因带来生物学洞察力。ENGEP在运行时和内存使用方面具有卓越的效率,使其可扩展用于分析大型数据集。
    Imaging-based spatial transcriptomics techniques provide valuable spatial and gene expression information at single-cell resolution. However, their current capability is restricted to profiling a limited number of genes per sample, resulting in most of the transcriptome remaining unmeasured. To overcome this challenge, we develop ENGEP, an ensemble learning-based tool that predicts unmeasured gene expression in spatial transcriptomics data by using multiple single-cell RNA sequencing datasets as references. ENGEP outperforms current state-of-the-art tools and brings biological insight by accurately predicting unmeasured genes. ENGEP has exceptional efficiency in terms of runtime and memory usage, making it scalable for analyzing large datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    人类生物学植根于由共同基因组编程的高度专业化的细胞类型,其中98%是外面基因。巨大的非编码空间中的遗传变异与大多数疾病风险有关。为了解决将这些变体与原代人细胞中的表达变化联系起来的问题,我们介绍ExPectoSC,基于模块化深度学习的模型图集,用于直接从序列预测细胞类型特异性基因表达。我们提供了涵盖7个器官系统的105种原代人类细胞类型的模型,证明他们的准确性,然后应用它们来区分复杂人类疾病的相关细胞类型。所得到的基于序列的基因表达和变异效应的图谱可在用户友好的界面中公开获得,并且易于扩展到任何原代细胞类型。我们通过系统评估证明了我们方法的准确性,并应用模型来优先考虑不确定意义的ClinVar临床变异,通过实验验证我们的最高预测。
    Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. Genetic variation in the enormous noncoding space is linked to the majority of disease risk. To address the problem of linking these variants to expression changes in primary human cells, we introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. We provide models for 105 primary human cell types covering 7 organ systems, demonstrate their accuracy, and then apply them to prioritize relevant cell types for complex human diseases. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We demonstrate the accuracy of our approach through systematic evaluations and apply the models to prioritize ClinVar clinical variants of uncertain significance, verifying our top predictions experimentally.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因表达可用于乳腺癌亚型,与使用常规免疫组织化学(IHC)获得的预测相比,对复发风险和治疗反应性的预测有所改善。然而,在诊所里,分子谱分析主要用于ER+乳腺癌,这是昂贵的,组织破坏性,需要专门的平台,需要几个星期才能得到结果。深度学习算法可以有效提取数字组织病理学图像中的形态学模式,以快速且经济高效地预测分子表型。我们提出了一个新的,计算有效的方法称为hist2RNA,其灵感来自大量RNA测序技术,以预测138个基因的表达(从6个市售分子谱分析测试中纳入),包括腔PAM50亚型,来自苏木精和曙红(H&E)染色的整个载玻片图像(WSI)。训练阶段涉及从预先训练的模型中汇总每位患者的提取特征,以使用来自癌症基因组图谱(TCGA,n=335)。我们在坚持的测试集上证明了成功的基因预测(n=160,corr=0.82,不同基因的corr=0.29),并在具有已知IHC和生存信息的外部组织微阵列(TMA)数据集(n=498)上进行探索性分析。我们的模型能够预测TMA数据集上的基因表达和管腔PAM50亚型(管腔A与管腔B),在单变量分析中对总生存期具有预后意义(c指数=0.56,风险比=2.16(95%CI1.12-3.06),p<5×10-3),在纳入标准临床病理变量的多变量分析中具有独立意义(c指数=0.65,风险比=1.87(95%CI1.30-2.68),p<5×10-3)。所提出的策略实现了卓越的性能,同时需要更少的培训时间,与基于补丁的模型相比,能耗和计算成本更低。此外,hist2RNA预测的基因表达有可能决定与总生存期相关的腔分子亚型,不需要昂贵的分子测试。
    Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ breast cancer, which is costly, tissue destructive, requires specialised platforms, and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA sequencing techniques to predict the expression of 138 genes (incorporated from 6 commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E)-stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n = 335). We demonstrate successful gene prediction on a held-out test set (n = 160, corr = 0.82 across patients, corr = 0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n = 498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index = 0.56, hazard ratio = 2.16 (95% CI 1.12-3.06), p < 5 × 10-3), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index = 0.65, hazard ratio = 1.87 (95% CI 1.30-2.68), p < 5 × 10-3). The proposed strategy achieves superior performance while requiring less training time, resulting in less energy consumption and computational cost compared to patch-based models. Additionally, hist2RNA predicts gene expression that has potential to determine luminal molecular subtypes which correlates with overall survival, without the need for expensive molecular testing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据驱动的机器学习是从核苷酸序列预测分子表型的首选方法。建模基因表达事件,包括蛋白质-DNA结合,染色质状态以及mRNA和蛋白质水平。深度神经网络自动学习信息序列表示,并解释它们使我们能够提高对控制基因表达的调控代码的理解。这里,我们回顾了应用浅层或深度学习来量化分子表型并从原核和真核测序数据中解码顺式调控语法的最新进展。我们的方法是从头开始建造,首先关注蛋白质-DNA相互作用的启动,然后是特定的编码区域和非编码区域,最后是结合基因和mRNA调控结构的多个部分的进展,实现前所未有的性能。因此,我们从核苷酸序列提供了基因表达调控的定量观点,最后以信息为中心的分子生物学中心主义概述。
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号