gene ontology

基因本体
  • 文章类型: Journal Article
    背景:集成来自多个域的数据可以大大提高分析工作流中生成的知识的质量和适用性。然而,处理健康数据是一项挑战,需要仔细的准备,以支持有意义的解释和稳健的结果。本体封装变量之间的关系,可以丰富健康数据集的语义内容,以增强可解释性并为下游分析提供信息。
    结果:我们开发了用于电子健康数据准备的R包,\"eHDPrep,“在多模态结直肠癌数据集上证明(661例患者,155个变量;Colo-661);另一个演示者取自癌症基因组图谱(459名患者,94个变量;TCGA-COAD)。eHDPrep提供了用户友好的质量控制方法,包括内部一致性检查和冗余去除和信息论变量合并。提供了语义丰富功能,根据变量之间的本体论共同祖先,能够生成新的信息“元变量”,在目前的研究中,用SNOMEDCT和基因本体论进行了证明。eHDPrep还有助于数字编码,从自由文本中提取变量,完整性分析,和用户查看对数据集的修改。
    结论:eHDPrep提供了有效的工具来评估和提高数据质量,为下游分析的稳健性能和可解释性奠定基础。应用于多模态结直肠癌数据集提高了数据质量,结构化,和强大的编码,以及增强的语义信息。我们使eHDPrep作为一个R包从CRAN(https://cran。r-project.org/package=eHDPrep)和GitHub(https://github.com/overton-group/eHDPrep)。
    Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.
    We developed an R package for electronic health data preparation, \"eHDPrep,\" demonstrated upon a multimodal colorectal cancer dataset (661 patients, 155 variables; Colo-661); a further demonstrator is taken from The Cancer Genome Atlas (459 patients, 94 variables; TCGA-COAD). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative \"meta-variables\" according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free text, completeness analysis, and user review of modifications to the dataset.
    eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to multimodal colorectal cancer datasets resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN (https://cran.r-project.org/package = eHDPrep) and GitHub (https://github.com/overton-group/eHDPrep).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全基因组转录组分析是一种在系统水平上产生植物生物学重要数据的方法。缺乏对植物中蛋白质和基因之间关系的了解,需要在蛋白质基因组水平上进行进一步的彻底分析。最近,我们的小组生成了15个甜樱桃(PrunusaviumL.)cv的定量蛋白质基因组图谱。由29,247个基因和7584个蛋白质代表的\'TraganaEdessis\'组织。本研究的目的是在基因/蛋白质水平上进行有针对性的分析,以评估它们之间的关系的结构。以及生物学意义。采用加权相关网络分析和因果模型,分别,聚集基因/蛋白质对,并揭示它们的因果关系,旨在评估相关的生物学功能。据我们所知,这是植物蛋白质基因组学概念中首次采用因果模型。分析揭示了基因/蛋白质之间因果关系的复杂性,这些基因/蛋白质对多年生果树的目标性状很重要。特别是关于甜樱桃的果实软化和成熟过程。因果发现可用于突出基因/蛋白质水平的持久关系,刺激生物学解释,促进植物蛋白质基因组图谱的进一步研究。
    Genome-wide transcriptome analysis is a method that produces important data on plant biology at a systemic level. The lack of understanding of the relationships between proteins and genes in plants necessitates a further thorough analysis at the proteogenomic level. Recently, our group generated a quantitative proteogenomic atlas of 15 sweet cherry (Prunus avium L.) cv. \'Tragana Edessis\' tissues represented by 29,247 genes and 7584 proteins. The aim of the current study was to perform a targeted analysis at the gene/protein level to assess the structure of their relation, and the biological implications. Weighted correlation network analysis and causal modeling were employed to, respectively, cluster the gene/protein pairs, and reveal their cause-effect relations, aiming to assess the associated biological functions. To the best of our knowledge, this is the first time that causal modeling has been employed within the proteogenomics concept in plants. The analysis revealed the complex nature of causal relations among genes/proteins that are important for traits of interest in perennial fruit trees, particularly regarding the fruit softening and ripening process in sweet cherry. Causal discovery could be used to highlight persistent relations at the gene/protein level, stimulating biological interpretation and facilitating further study of the proteogenomic atlas in plants.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    阿尔茨海默病(AD)是人类最常见的进行性神经退行性疾病,目前无法治愈。广泛的合并症,包括其他神经退行性疾病,经常与AD有关。可以通过使用生物信息学工具分析受影响组织中的基因表达模式来检查AD如何与这些合并症相互作用。我们调查了公共数据存储库,以获取有关AD受试者和受神经退行性疾病影响的人的组织的可用基因表达数据,这些疾病通常与AD合并症有关。然后我们利用了大量的基因表达数据,细胞相关数据和其他公共资源通过分析过程来确定功能性疾病的联系。此过程结合了基因集富集分析,并利用语义相似性来给出邻近度量。我们确定了与AD及其合并症常见的异常表达基因,以及共享的基因本体论术语和分子通路。我们的方法学管道在R平台中作为开源软件包实现,可通过以下链接获得:https://github.com/unchowdhury/AD_coorbidds。因此,该管道能够确定可能构成AD与这些共同合并症之间的功能联系的因素和途径,这些共同合并症通过它们影响彼此的发展和进展。该管道还可以用于识别其他疾病和疾病相互作用的关键病理因素和治疗靶标。
    Alzheimer\'s disease (AD) is the commonest progressive neurodegenerative condition in humans, and is currently incurable. A wide spectrum of comorbidities, including other neurodegenerative diseases, are frequently associated with AD. How AD interacts with those comorbidities can be examined by analysing gene expression patterns in affected tissues using bioinformatics tools. We surveyed public data repositories for available gene expression data on tissue from AD subjects and from people affected by neurodegenerative diseases that are often found as comorbidities with AD. We then utilized large set of gene expression data, cell-related data and other public resources through an analytical process to identify functional disease links. This process incorporated gene set enrichment analysis and utilized semantic similarity to give proximity measures. We identified genes with abnormal expressions that were common to AD and its comorbidities, as well as shared gene ontology terms and molecular pathways. Our methodological pipeline was implemented in the R platform as an open-source package and available at the following link: https://github.com/unchowdhury/AD_comorbidity. The pipeline was thus able to identify factors and pathways that may constitute functional links between AD and these common comorbidities by which they affect each others development and progression. This pipeline can also be useful to identify key pathological factors and therapeutic targets for other diseases and disease interactions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究描述了两种互补的方法,使用基于网络和序列相似性工具来确定药物再利用机会预测调节病毒蛋白。这种方法可以迅速适应新的和新兴的病毒。第一种方法构建并研究了病毒-宿主-物理相互作用网络;药物靶蛋白的三层多模态网络,人类蛋白质-蛋白质相互作用,和病毒-宿主蛋白质-蛋白质相互作用。第二种方法评估了病毒蛋白和其他蛋白之间的序列相似性,通过构建病毒-宿主-相似性交互网络进行可视化。方法在人类免疫缺陷病毒上进行了验证,乙型肝炎,丙型肝炎,和人乳头瘤病毒,然后部署在SARS-CoV-2上。病毒-宿主-物理相互作用预测与已知抗病毒药物的AUC分别为0.69、0.59、0.78和0.67,反映出分数是有效药物的预测。对于SARS-CoV-2,预测了569种候选药物,其中37例纳入SARS-CoV-2的临床试验(AUC=0.75,P值3.21×10-3)。作为进一步的验证,分析了排名靠前的候选抗病毒药物与蛋白质靶标的结合情况;BindScope产生的结合评分表明成功率为70%.
    This study describes two complementary methods that use network-based and sequence similarity tools to identify drug repurposing opportunities predicted to modulate viral proteins. This approach could be rapidly adapted to new and emerging viruses. The first method built and studied a virus-host-physical interaction network; a three-layer multimodal network of drug target proteins, human protein-protein interactions, and viral-host protein-protein interactions. The second method evaluated sequence similarity between viral proteins and other proteins, visualized by constructing a virus-host-similarity interaction network. Methods were validated on the human immunodeficiency virus, hepatitis B, hepatitis C, and human papillomavirus, then deployed on SARS-CoV-2. Comparison of virus-host-physical interaction predictions to known antiviral drugs had AUCs of 0.69, 0.59, 0.78, and 0.67, respectively, reflecting that the scores are predictive of effective drugs. For SARS-CoV-2, 569 candidate drugs were predicted, of which 37 had been included in clinical trials for SARS-CoV-2 (AUC = 0.75, P-value 3.21 × 10-3). As further validation, top-ranked candidate antiviral drugs were analyzed for binding to protein targets in silico; binding scores generated by BindScope indicated a 70% success rate.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    已经在外周血样品中测量了全基因组转录作为与重度抑郁症相关的炎症的候选生物标志物。
    我们搜索了所有关于重度抑郁症的病例对照研究,这些研究报告了对全血或外周血单核细胞的微阵列或RNA测序测量。重新分析了原始数据集,当公开访问时,评估病例对照差异,并通过技术统一的方法评估差异表达基因列表的功能作用。
    我们发现了10项符合条件的研究(N=1754例抑郁症和N=1145例健康对照)。52个基因被认为是有意义的2个主要研究(已发表的重叠列表)。在8个可访问数据集的分析统一后(n=1706例,n=1098控件),在2个或更多个全血或外周血单核细胞的研究中,272个基因被巧合地列为前3%最差异表达的基因,具有一致的作用方向(协调重叠列表)。通过对4项全血样本研究的标准化平均差异进行荟萃分析(n=1567例,n=954个控件),发现343个基因的错误发现率<5%(标准化平均差异荟萃分析列表)。这三个列表明显交叉。在重度抑郁症中异常表达的基因富含先天免疫相关功能,编码非随机蛋白质-蛋白质相互作用网络,并在专门用于先天免疫和中性粒细胞功能的规范转录组模块中共表达。
    对现有病例对照数据的定量审查为对先天免疫应答的调节和实施重要的基因网络的异常表达提供了有力的证据。似乎有必要进一步开发炎症抑郁症的白细胞转录生物标志物。
    Whole-genome transcription has been measured in peripheral blood samples as a candidate biomarker of inflammation associated with major depressive disorder.
    We searched for all case-control studies on major depressive disorder that reported microarray or RNA sequencing measurements on whole blood or peripheral blood mononuclear cells. Primary datasets were reanalyzed, when openly accessible, to estimate case-control differences and to evaluate the functional roles of differentially expressed gene lists by technically harmonized methods.
    We found 10 eligible studies (N = 1754 depressed cases and N = 1145 healthy controls). Fifty-two genes were called significant by 2 of the primary studies (published overlap list). After harmonization of analysis across 8 accessible datasets (n = 1706 cases, n = 1098 controls), 272 genes were coincidentally listed in the top 3% most differentially expressed genes in 2 or more studies of whole blood or peripheral blood mononuclear cells with concordant direction of effect (harmonized overlap list). By meta-analysis of standardized mean difference across 4 studies of whole-blood samples (n = 1567 cases, n = 954 controls), 343 genes were found with false discovery rate <5% (standardized mean difference meta-analysis list). These 3 lists intersected significantly. Genes abnormally expressed in major depressive disorder were enriched for innate immune-related functions, coded for nonrandom protein-protein interaction networks, and coexpressed in the normative transcriptome module specialized for innate immune and neutrophil functions.
    Quantitative review of existing case-control data provided robust evidence for abnormal expression of gene networks important for the regulation and implementation of innate immune response. Further development of white blood cell transcriptional biomarkers for inflamed depression seems warranted.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    The kernel canonical correlation analysis based U-statistic (KCCU) is being used to detect nonlinear gene-gene co-associations. Estimating the variance of the KCCU is however computationally intensive. In addition, the kernel canonical correlation analysis (kernel CCA) is not robust to contaminated data. Using a robust kernel mean element and a robust kernel (cross)-covariance operator potentially enables the use of a robust kernel CCA, which is studied in this paper. We first propose an influence function-based estimator for the variance of the KCCU. We then present a non-parametric robust KCCU, which is designed for dealing with contaminated data. The robust KCCU is less sensitive to noise than KCCU. We investigate the proposed method using both synthesized and real data from the Mind Clinical Imaging Consortium (MCIC). We show through simulation studies that the power of the proposed methods is a monotonically increasing function of sample size, and the robust test statistics bring incremental gains in power. To demonstrate the advantage of the robust kernel CCA, we study MCIC data among 22,442 candidate Schizophrenia genes for gene-gene co-associations. We select 768 genes with strong evidence for shedding light on gene-gene interaction networks for Schizophrenia. By performing gene ontology enrichment analysis, pathway analysis, gene-gene network and other studies, the proposed robust methods can find undiscovered genes in addition to significant gene pairs, and demonstrate superior performance over several of current approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Disease comorbidity is very common and has significant impact on disease treatment. Revealing the associations among diseases may help to understand the mechanisms of diseases, improve the prevention and treatment of diseases, and support the discovery of new drugs or new uses of existing drugs.
    In this paper, we introduced a mathematical model to represent gene related diseases with a series of associated genes based on the overrepresentation of genes and diseases in PubMed literature. We also illustrated an efficient way to reveal the implicit connections between COPD and other diseases based on this model.
    We applied this approach to analyze the relationships between Chronic Obstructive Pulmonary Disease (COPD) and other diseases under the Lung diseases branch in the Medical subject heading index system and detected 4 novel diseases relevant to COPD. As judged by domain experts, the F score of our approach is up to 77.6%.
    The results demonstrate the effectiveness of the gene fingerprint model for diseases on the basis of medical literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:RNA测序分析越来越多地用于研究没有测序基因组的非模型生物中的基因表达。Aethionemaarabicum(十字花科)表现出种子二态性作为一种下注策略-产生休眠较少的粘质(M)种子形态和休眠较多的非粘质(NM)种子形态。这里,我们比较了从头和基于参考基因组的转录组组装来研究Ae。阿拉伯种子二态性,并评估用于鉴定差异表达基因(DEGs)的无参考与依赖方法。
    结果:使用来自M+和NMAe的序列产生从头转录组组装。阿拉伯干种子变形。从头组装的转录本含有63.1%的完全基准通用单拷贝直系同源物(BUSCO),而参考基因组的转录本含有90.9%。DEG检测使用三种方法(DESeq2、edgeR和NOISeq)的严格一致性。1533个差异表达的从头组装转录物中只有37%与1876个基因组衍生的DEGs配对。基因本体论(GO)术语区分了种子形态:术语翻译和核小体组装在M干种子中的DEGs丰度更高,而与mRNA加工和转录相关的术语在NM干种子中的丰度较高的DEG中过多。这些GO术语中的DEG包括核糖体蛋白和组蛋白(M+较高),RNA聚合酶II亚基和相关的转录和延伸因子(在NM中较高)。将推断的DEGs和与种子成熟相关的其他基因(例如编码晚期胚胎发生丰富蛋白和调节种子发育和成熟的转录因子的基因,例如ABI3,FUS3,LEC1和WRI1同源物)的表达置于拟南芥种子成熟的背景下,表明M种子可能比NM更快地干燥和成熟。1901转录组DEG集合GO项与2191个基因组衍生的DEGGO项具有几乎90%的重叠。
    结论:虽然在无参考方法和依赖方法中确定的DEG只有适度的重叠,两种方法的GO分析结果一致.干种子转录组的差异表明了先前确定的M和NM种子的形态与发芽行为之间形成对比的机制。
    BACKGROUND: RNA-sequencing analysis is increasingly utilized to study gene expression in non-model organisms without sequenced genomes. Aethionema arabicum (Brassicaceae) exhibits seed dimorphism as a bet-hedging strategy - producing both a less dormant mucilaginous (M+) seed morph and a more dormant non-mucilaginous (NM) seed morph. Here, we compared de novo and reference-genome based transcriptome assemblies to investigate Ae. arabicum seed dimorphism and to evaluate the reference-free versus -dependent approach for identifying differentially expressed genes (DEGs).
    RESULTS: A de novo transcriptome assembly was generated using sequences from M+ and NM Ae. arabicum dry seed morphs. The transcripts of the de novo assembly contained 63.1% complete Benchmarking Universal Single-Copy Orthologs (BUSCO) compared to 90.9% for the transcripts of the reference genome. DEG detection used the strict consensus of three methods (DESeq2, edgeR and NOISeq). Only 37% of 1533 differentially expressed de novo assembled transcripts paired with 1876 genome-derived DEGs. Gene Ontology (GO) terms distinguished the seed morphs: the terms translation and nucleosome assembly were overrepresented in DEGs higher in abundance in M+ dry seeds, whereas terms related to mRNA processing and transcription were overrepresented in DEGs higher in abundance in NM dry seeds. DEGs amongst these GO terms included ribosomal proteins and histones (higher in M+), RNA polymerase II subunits and related transcription and elongation factors (higher in NM). Expression of the inferred DEGs and other genes associated with seed maturation (e.g. those encoding late embryogenesis abundant proteins and transcription factors regulating seed development and maturation such as ABI3, FUS3, LEC1 and WRI1 homologs) were put in context with Arabidopsis thaliana seed maturation and indicated that M+ seeds may desiccate and mature faster than NM. The 1901 transcriptomic DEG set GO-terms had almost 90% overlap with the 2191 genome-derived DEG GO-terms.
    CONCLUSIONS: Whilst there was only modest overlap of DEGs identified in reference-free versus -dependent approaches, the resulting GO analysis was concordant in both approaches. The identified differences in dry seed transcriptomes suggest mechanisms underpinning previously identified contrasts between morphology and germination behaviour of M+ and NM seeds.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Deciphering the underlying genetic basis behind pancreatic cancer (PC) and its associated multimorbidities will enhance our knowledge toward PC control. The study investigated the common genetic background of PC and different morbidities through a computational approach and further evaluated the less explored association between PC and autoimmune diseases (AIDs) through an epidemiological analysis. Gene-disease associations (GDAs) of 26 morbidities of interest and PC were obtained using the DisGeNET public discovery platform. The association between AIDs and PC pointed by the computational analysis was confirmed through multivariable logistic regression models in the PanGen European case-control study population of 1,705 PC cases and 1,084 controls. Fifteen morbidities shared at least one gene with PC in the DisGeNET database. Based on common genes, several AIDs were genetically associated with PC pointing to a potential link between them. An epidemiologic analysis confirmed that having any of the nine AIDs studied was significantly associated with a reduced risk of PC (Odds Ratio (OR) = 0.74, 95% confidence interval (CI) 0.58-0.93) which decreased in subjects having ≥2 AIDs (OR = 0.39, 95%CI 0.21-0.73). In independent analyses, polymyalgia rheumatica, and rheumatoid arthritis were significantly associated with low PC risk (OR = 0.40, 95%CI 0.19-0.89, and OR = 0.73, 95%CI 0.53-1.00, respectively). Several inflammatory-related morbidities shared a common genetic component with PC based on public databases. These molecular links could shed light into the molecular mechanisms underlying PC development and simultaneously generate novel hypotheses. In our study, we report sound findings pointing to an association between AIDs and a reduced risk of PC.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号