gene ontology

基因本体
  • 文章类型: Journal Article
    用注释基因组表征物种数量和多样性的基因功能几乎完全依赖于计算预测方法。这些软件也是多种多样的,每个人都有不同的优势和劣势,通过社区基准努力揭示。评估来自各个算法的共识和冲突的元预测因子应该提供增强的功能注释。为了利用元方法的好处,我们开发了CrowdGO,一个开源的基于共识的基因本体论(GO)术语元预测因子,采用具有GO术语语义相似性和信息内容的机器学习模型。通过重新评估每个基因术语注释,使用高评分的自信注释和低评分的拒绝注释生成共识数据集.将CrowdGO应用于基于深度学习的结果,基于序列相似性的,和两种基于蛋白质结构域的方法,以更高的精度和召回率提供共识注释。此外,使用标准评估措施CrowdGO的表现与社区表现最好的个人方法相匹配。因此,CrowdGO提供了一种基于模型的方法来利用个体预测因子的优势,并产生全面而准确的基因功能注释。
    Characterising gene function for the ever-increasing number and diversity of species with annotated genomes relies almost entirely on computational prediction methods. These software are also numerous and diverse, each with different strengths and weaknesses as revealed through community benchmarking efforts. Meta-predictors that assess consensus and conflict from individual algorithms should deliver enhanced functional annotations. To exploit the benefits of meta-approaches, we developed CrowdGO, an open-source consensus-based Gene Ontology (GO) term meta-predictor that employs machine learning models with GO term semantic similarities and information contents. By re-evaluating each gene-term annotation, a consensus dataset is produced with high-scoring confident annotations and low-scoring rejected annotations. Applying CrowdGO to results from a deep learning-based, a sequence similarity-based, and two protein domain-based methods, delivers consensus annotations with improved precision and recall. Furthermore, using standard evaluation measures CrowdGO performance matches that of the community\'s best performing individual methods. CrowdGO therefore offers a model-informed approach to leverage strengths of individual predictors and produce comprehensive and accurate gene functional annotations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在分子生物学和遗传学中,数据收集的便利性与我们从这些数据中提取知识的能力之间存在很大差距。造成这种差距的事实是,生物体是复杂的系统,其新兴的表型是在各种途径上发生的多种复杂相互作用的结果。这需要强大但用户友好的途径分析工具,以将现在丰富的高通量数据转化为对潜在生物现象的更好理解。在这里,我们介绍共识路径分析(CPA),一个基于网络的平台,允许研究人员(I)使用八种既定方法(GSEA,GSA,FGSEA,PADOG,影响分析,ORA/Webgestalt,KS测试,Wilcox检验),(Ii)对多个数据集进行荟萃分析,(iii)结合方法和数据集,以准确地识别受影响的途径潜在的研究条件和(iv)交互探索受影响的途径,浏览通路和基因之间的关系。该平台支持三种类型的输入:(i)差异表达基因列表,(ii)基因和折叠变化和(iii)表达矩阵。它还允许用户从NCBIGEO导入数据。CPA平台目前支持使用KEGG和GeneOntology分析多种生物,它可以在http://cpa上免费获得。tinnguyen-lab.com.
    In molecular biology and genetics, there is a large gap between the ease of data collection and our ability to extract knowledge from these data. Contributing to this gap is the fact that living organisms are complex systems whose emerging phenotypes are the results of multiple complex interactions taking place on various pathways. This demands powerful yet user-friendly pathway analysis tools to translate the now abundant high-throughput data into a better understanding of the underlying biological phenomena. Here we introduce Consensus Pathway Analysis (CPA), a web-based platform that allows researchers to (i) perform pathway analysis using eight established methods (GSEA, GSA, FGSEA, PADOG, Impact Analysis, ORA/Webgestalt, KS-test, Wilcox-test), (ii) perform meta-analysis of multiple datasets, (iii) combine methods and datasets to accurately identify the impacted pathways underlying the studied condition and (iv) interactively explore impacted pathways, and browse relationships between pathways and genes. The platform supports three types of input: (i) a list of differentially expressed genes, (ii) genes and fold changes and (iii) an expression matrix. It also allows users to import data from NCBI GEO. The CPA platform currently supports the analysis of multiple organisms using KEGG and Gene Ontology, and it is freely available at http://cpa.tinnguyen-lab.com.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:癌症的标志提供了一个高度引用和广泛使用的概念框架,用于描述涉及癌细胞发育和肿瘤发生的过程。然而,将这些高级概念转化为标志和基因之间的数据级关联的方法(用于高通量分析),研究之间差异很大。检查不同的策略来关联和绘制癌症标志,揭示了显著的差异,也是共识。
    结果:在这里,我们介绍了癌症标志作图策略的比较分析结果,基于基因本体论和生物通路注释,从不同的研究。通过分析注释之间的语义相似性,由此产生的基因集重叠,我们确定新兴的共识知识。此外,我们使用加权基因共表达网络分析和富集分析分析了标志和基因集关联之间的差异。
    结论:就如何从研究数据中识别癌症标志活动达成全社区共识,将有助于更系统的数据整合和研究之间的比较。这些结果突出了共识的现状,并为进一步融合提供了起点。此外,我们展示了缺乏共识如何导致下游分析的生物学解释存在巨大差异,并讨论了注释变化和积累生物学数据的挑战,使用也随着时间的推移而变化的中间知识资源。
    BACKGROUND: The hallmarks of cancer provide a highly cited and well-used conceptual framework for describing the processes involved in cancer cell development and tumourigenesis. However, methods for translating these high-level concepts into data-level associations between hallmarks and genes (for high throughput analysis), vary widely between studies. The examination of different strategies to associate and map cancer hallmarks reveals significant differences, but also consensus.
    RESULTS: Here we present the results of a comparative analysis of cancer hallmark mapping strategies, based on Gene Ontology and biological pathway annotation, from different studies. By analysing the semantic similarity between annotations, and the resulting gene set overlap, we identify emerging consensus knowledge. In addition, we analyse the differences between hallmark and gene set associations using Weighted Gene Co-expression Network Analysis and enrichment analysis.
    CONCLUSIONS: Reaching a community-wide consensus on how to identify cancer hallmark activity from research data would enable more systematic data integration and comparison between studies. These results highlight the current state of the consensus and offer a starting point for further convergence. In addition, we show how a lack of consensus can lead to large differences in the biological interpretation of downstream analyses and discuss the challenges of annotating changing and accumulating biological data, using intermediate knowledge resources that are also changing over time.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    许多复杂的疾病是由遗传因素引起的。基因相互作用途径的扰动导致此类疾病。即使一组基因负责,一些重要的基因作为疾病的生物标志物,扰乱健康的网络。识别这些标记基因或一组在疾病中起关键作用的基因有助于药物优先排序。我们提出了一种使用多层共识驱动方法寻找潜在生物标记的方案。我们重建了一个功能模块引导的疾病子网络,其次是网络推理方法和共享本体论术语的多步骤共识。我们对正在考虑的子网络进行中心性分析,并将枢纽基因报告为目标疾病的潜在关键参与者。为了确定我们计划的有效性,我们使用阿尔茨海默病(AD)和乳腺癌作为实验的候选疾病。我们根据报道的证据评估优先基因的重要性。我们观察到BRCA1,BRCA2和PTEN是乳腺癌的重要基因,而MAPK1、APP、CASP7和CASP7是在AD过程中发挥重要作用的必需基因。
    Many complex diseases occur due to genetic factors. A perturbation in the pathway of gene interactions leads to such disorders. Even though a group of genes is responsible, a few significant genes act as a biomarker for disease, perturbing the healthy network. Identifying such marker genes or a set of genes that play a pivotal role in diseases helps drug prioritization. We propose a scheme for finding potential bio-markers using a multi-layer consensus-driven approach. We reconstruct a functional module guided disease sub-network, followed by a multi-step consensus of network inference methods and shared ontological terms. We perform centrality analysis on the sub-networks under consideration and report hub genes as potentially key players in the target disease. To establish our scheme\'s effectiveness, we use Alzheimer\'s Disease (AD) and Breast Cancer as candidate diseases for experimentation. We evaluate the significance of prioritized genes based on reported evidence. We observe that BRCA1, BRCA2, and PTEN are the essential genes for Breast Cancer, whereas MAPK1, APP, and CASP7 are the essential genes playing an important role during AD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    共分级分离MS(CF-MS)是一种具有以前所未有的规模表征内源性和未操作蛋白质复合物的潜力的技术。然而,这种潜力被缺乏最佳做法CF-MS数据收集和分析指南所抵消。为了获得这样的指导方针,这项研究使用非常高的酵母金标准复合物的蛋白质组覆盖率文库彻底评估了新颖和已发表的酿酒酵母CF-MS数据集。一种鉴定CF-MS数据中金标准络合物的新方法,参考复杂分析,并将按学位扩展的“按协会负罪感”(EGAD)R包用于这些评估,通过对已发布的人类数据的并发分析进行验证。通过评估数据收集设计,涉及细胞裂解物的分馏,发现与已发表的研究相比,用更少的样本可以实现复合物的接近最大的召回。通过正交分馏方法分配样品收集,而不是一个单一的高分辨率数据集,导致特别有效的召回。通过评估17种不同的相似性评分指标,这是CF-MS数据分析的核心,发现在过去的CF-MS研究中很少使用的两个指标-Spearman和Kendall相关性-以及最近引入的Co-apex指标经常最大化召回,而流行的度量-欧几里得距离-提供较差的召回。还评估了将外部基因组数据整合到CF-MS数据分析中的常见做法,揭示了这种做法可以提高已知复合物的精确度和召回率,但通常不适合预测模型生物中的新型复合物。如果使用直系同源基因组数据研究非模型生物,发现应排除分馏曲线的特定子集(例如最低丰度四分位数)以最大程度地减少错误发现。这些评估总结在一系列普遍适用的准则中,已知复合物的灵敏有效的CF-MS研究,并对新型配合物进行有效预测,以进行正交实验验证。
    Co-fractionation MS (CF-MS) is a technique with potential to characterize endogenous and unmanipulated protein complexes on an unprecedented scale. However this potential has been offset by a lack of guidelines for best-practice CF-MS data collection and analysis. To obtain such guidelines, this study thoroughly evaluates novel and published Saccharomyces cerevisiae CF-MS data sets using very high proteome coverage libraries of yeast gold standard complexes. A new method for identifying gold standard complexes in CF-MS data, Reference Complex Profiling, and the Extending \'Guilt-by-Association\' by Degree (EGAD) R package are used for these evaluations, which are verified with concurrent analyses of published human data. By evaluating data collection designs, which involve fractionation of cell lysates, it is found that near-maximum recall of complexes can be achieved with fewer samples than published studies. Distributing sample collection across orthogonal fractionation methods, rather than a single high resolution data set, leads to particularly efficient recall. By evaluating 17 different similarity scoring metrics, which are central to CF-MS data analysis, it is found that two metrics rarely used in past CF-MS studies - Spearman and Kendall correlations - and the recently introduced Co-apex metric frequently maximize recall, whereas a popular metric-Euclidean distance-delivers poor recall. The common practice of integrating external genomic data into CF-MS data analysis is also evaluated, revealing that this practice may improve the precision and recall of known complexes but is generally unsuitable for predicting novel complexes in model organisms. If studying nonmodel organisms using orthologous genomic data, it is found that particular subsets of fractionation profiles (e.g. the lowest abundance quartile) should be excluded to minimize false discovery. These assessments are summarized in a series of universally applicable guidelines for precise, sensitive and efficient CF-MS studies of known complexes, and effective predictions of novel complexes for orthogonal experimental validation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    骨肉瘤是原发性骨癌最常见的亚型,主要影响青少年。近年来,一些研究集中在阐明这种肉瘤的分子机制;然而,其分子病因尚未精确确定。因此,我们采用了一种共识策略,并使用了几种生物信息学工具,对参与其发病机制的基因进行了优先排序.随后,我们评估了先前选择的基因的物理相互作用,并对该蛋白质-蛋白质相互作用网络进行了共性分析.共识策略优先列出了总共553个基因。我们的富集分析验证了将信号通路PI3K/AKT和MAPK/ERK描述为致病性的几项研究。基因本体论将TP53描述为主要的信号转导子,主要介导与细胞周期和DNA损伤反应相关的过程。有趣的是,社区分析将参与转移事件的几个成员聚集在一起,如MMP2和MMP9,以及与DNA修复复合物相关的基因,比如ATM,ATR,CHEK1和RAD51。在这项研究中,我们已经确定了众所周知的骨肉瘤致病基因和需要进一步探索的优先基因。
    Osteosarcoma is the most common subtype of primary bone cancer, affecting mostly adolescents. In recent years, several studies have focused on elucidating the molecular mechanisms of this sarcoma; however, its molecular etiology has still not been determined with precision. Therefore, we applied a consensus strategy with the use of several bioinformatics tools to prioritize genes involved in its pathogenesis. Subsequently, we assessed the physical interactions of the previously selected genes and applied a communality analysis to this protein-protein interaction network. The consensus strategy prioritized a total list of 553 genes. Our enrichment analysis validates several studies that describe the signaling pathways PI3K/AKT and MAPK/ERK as pathogenic. The gene ontology described TP53 as a principal signal transducer that chiefly mediates processes associated with cell cycle and DNA damage response It is interesting to note that the communality analysis clusters several members involved in metastasis events, such as MMP2 and MMP9, and genes associated with DNA repair complexes, like ATM, ATR, CHEK1, and RAD51. In this study, we have identified well-known pathogenic genes for osteosarcoma and prioritized genes that need to be further explored.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:网络分析,例如基因共表达网络,代谢网络和生态网络已成为系统级生物数据研究的核心方法。存在用于生成和分析此类网络的几个软件包,来自相关分数或称为加权拓扑重叠(wTO)的转换分数的绝对值。然而,因为基因调节过程可以上调或下调基因,在构建基因共表达网络时,明确考虑正相关和负相关是非常有意义的。
    结果:这里,我们给出了一个用于计算加权拓扑重叠(wTO)的R包,That,与现有的软件包相比,明确解决wTO值的符号,因此对于分析基因调控网络特别有价值。该软件包包括每个成对基因得分的p值(原始和调整)的计算。我们的软件包还允许从时间序列计算网络(没有重复)。由于来自独立数据集(生物重复或相关研究)的网络由于数据中的技术和生物噪声而不相同,我们另外,将一种从两个或多个网络计算共识网络(CN)的新方法纳入我们的R包中。要以图形方式检查生成的网络,R包包含一个可视化工具,这允许直接网络操作和访问节点和链路信息。在标准笔记本电脑上测试软件包时,我们可以在两个小时内对超过20,000个基因的系统进行所有计算。我们将我们的新wTO软件包与最先进的软件包进行了比较,并使用来自健康人类前额叶皮层样本的3个独立衍生数据集演示了wTO和CN功能的应用。为了展示时间序列应用程序的示例,我们使用了宏基因组学数据集。
    结论:在这项工作中,我们开发了一个软件包,允许计算wTO网络,CN和R统计环境中的可视化工具。它可以在GPL-2开源许可证(https://cran)下在CRAN存储库中公开提供。r-project.org/web/packages/wTO/)。
    BACKGROUND: Network analyses, such as of gene co-expression networks, metabolic networks and ecological networks have become a central approach for the systems-level study of biological data. Several software packages exist for generating and analyzing such networks, either from correlation scores or the absolute value of a transformed score called weighted topological overlap (wTO). However, since gene regulatory processes can up- or down-regulate genes, it is of great interest to explicitly consider both positive and negative correlations when constructing a gene co-expression network.
    RESULTS: Here, we present an R package for calculating the weighted topological overlap (wTO), that, in contrast to existing packages, explicitly addresses the sign of the wTO values, and is thus especially valuable for the analysis of gene regulatory networks. The package includes the calculation of p-values (raw and adjusted) for each pairwise gene score. Our package also allows the calculation of networks from time series (without replicates). Since networks from independent datasets (biological repeats or related studies) are not the same due to technical and biological noise in the data, we additionally, incorporated a novel method for calculating a consensus network (CN) from two or more networks into our R package. To graphically inspect the resulting networks, the R package contains a visualization tool, which allows for the direct network manipulation and access of node and link information. When testing the package on a standard laptop computer, we can conduct all calculations for systems of more than 20,000 genes in under two hours. We compare our new wTO package to state of art packages and demonstrate the application of the wTO and CN functions using 3 independently derived datasets from healthy human pre-frontal cortex samples. To showcase an example for the time series application we utilized a metagenomics data set.
    CONCLUSIONS: In this work, we developed a software package that allows the computation of wTO networks, CNs and a visualization tool in the R statistical environment. It is publicly available on CRAN repositories under the GPL -2 Open Source License ( https://cran.r-project.org/web/packages/wTO/ ).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    The diploid strawberry, Fragaria vesca, is a developing model system for the economically important Rosaceae family. Strawberry fleshy fruit develops from the floral receptacle and its ripening is nonclimacteric. The external seed configuration of strawberry fruit facilitates the study of seed-to-fruit cross tissue communication, particularly phytohormone biosynthesis and transport. To investigate strawberry fruit development, we previously generated spatial and temporal transcriptome data profiling F. vesca flower and fruit development pre- and postfertilization. In this study, we combined 46 of our existing RNA-seq libraries to generate coexpression networks using the Weighted Gene Co-Expression Network Analysis package in R. We then applied a post-hoc consensus clustering approach and used bootstrapping to demonstrate consensus clustering\'s ability to produce robust and reproducible clusters. Further, we experimentally tested hypotheses based on the networks, including increased iron transport from the receptacle to the seed postfertilization and characterized a F. vesca floral mutant and its candidate gene. To increase their utility, the networks are presented in a web interface (www.fv.rosaceaefruits.org) for easy exploration and identification of coexpressed genes. Together, the work reported here illustrates ways to generate robust networks optimized for the mining of large transcriptome data sets, thereby providing a useful resource for hypothesis generation and experimental design in strawberry and related Rosaceae fruit crops.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    BACKGROUND: The systemic information enclosed in microarray data encodes relevant clues to overcome the poorly understood combination of genetic and environmental factors in Parkinson\'s disease (PD), which represents the major obstacle to understand its pathogenesis and to develop disease-modifying therapeutics. While several gene prioritization approaches have been proposed, none dominate over the rest. Instead, hybrid approaches seem to outperform individual approaches.
    METHODS: A consensus strategy is proposed for PD related gene prioritization from mRNA microarray data based on the combination of three independent prioritization approaches: Limma, machine learning, and weighted gene co-expression networks.
    RESULTS: The consensus strategy outperformed the individual approaches in terms of statistical significance, overall enrichment and early recognition ability. In addition to a significant biological relevance, the set of 50 genes prioritized exhibited an excellent early recognition ability (6 of the top 10 genes are directly associated with PD). 40 % of the prioritized genes were previously associated with PD including well-known PD related genes such as SLC18A2, TH or DRD2. Eight genes (CCNH, DLK1, PCDH8, SLIT1, DLD, PBX1, INSM1, and BMI1) were found to be significantly associated to biological process affected in PD, representing potentially novel PD biomarkers or therapeutic targets. Additionally, several metrics of standard use in chemoinformatics are proposed to evaluate the early recognition ability of gene prioritization tools.
    CONCLUSIONS: The proposed consensus strategy represents an efficient and biologically relevant approach for gene prioritization tasks providing a valuable decision-making tool for the study of PD pathogenesis and the development of disease-modifying PD therapeutics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号