protein annotations

  • 文章类型: Journal Article
    在UniProtKB中,到目前为止,有超过2.51亿种蛋白质沉积。然而,只有0.25%的人被注释了超过15000个可能的Pfam家族域之一。当前的注释协议集成了来自手动策划的家族域的知识,使用序列比对和隐马尔可夫模型获得。这种方法已经成功地自动增加了Pfam注释,然而,与蛋白质发现相比,速度较低。就在几年前,提出了用于自动Pfam标注的深度学习模型。然而,这些模型需要大量的训练数据,这对人口稠密的家庭来说可能是一个挑战。为了解决这个问题,我们在这里提出并评估了一个基于迁移学习的新协议,他需要使用蛋白质大语言模型(LLM),在大型非纳米数据集上进行自我监督训练,以获得序列嵌入。然后,嵌入可以与监督学习一起使用,在一个小的、带注释的数据集上进行专门任务。在这个协议中,我们已经评估了几种尖端的蛋白质LLM以及机器学习架构,以改善蛋白质域注释的实际预测。结果明显优于蛋白质家族分类的最新技术,与标准方法相比,预测误差降低了令人印象深刻的60%。我们解释了LLM嵌入如何以一种具体而简单的方式用于蛋白质注释,并在github回购中提供管道。完整的源代码和数据可在https://github.com/sinc-lab/llm4pfam获得。
    In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    The Citrus genus comprises some of the most important and commonly cultivated fruit plants. Within the last decade, citrus greening disease (also known as huanglongbing or HLB) has emerged as the biggest threat for the citrus industry. This disease does not have a cure yet and, thus, many efforts have been made to find a solution to this devastating condition. There are challenges in the generation of high-yield resistant cultivars, in part due to the limited and sparse knowledge about the mechanisms that are used by the Liberibacter bacteria to proliferate the infection in Citrus plants. Here, we present GreeningDB, a database implemented to provide the annotation of Liberibacter proteomes, as well as the host-pathogen comparactomics tool, a novel platform to compare the predicted interactomes of two HLB host-pathogen systems. GreeningDB is built to deliver a user-friendly interface, including network visualization and links to other resources. We hope that by providing these characteristics, GreeningDB can become a central resource to retrieve HLB-related protein annotations, and thus, aid the community that is pursuing the development of molecular-based strategies to mitigate this disease\'s impact. The database is freely available at http://bioinfo.usu.edu/GreeningDB/ (accessed on 11 August 2021).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:基因组数据很普遍,导致频繁遇到未知作用机制的未解释变异或突变。研究人员必须手动汇总来自多个来源和相关蛋白质的数据,在基因组和蛋白质组之间的精神翻译效应,试图理解机制。
    方法:P2T2以统一的蛋白质为中心的观点呈现不同的数据和注释类型,促进编码变体的解释和假设的生成。来自主序列的信息,域,主题,和结构水平被提出,并组织成第一个模拟注释分析在整个人类蛋白质组。
    结果:我们的工具通过聚合多样性来帮助研究努力解释基因组变异,相关,和蛋白质组范围内的信息到一个统一的交互式网络界面。此外,我们提供了一个支持自动数据查询的RESTAPI,或将数据重新用于其他研究。
    结论:P2T2中呈现的统一的以蛋白质为中心的界面将有助于研究人员解释通过下一代测序鉴定的新变体。代码和服务器链接可在github.com/GenomicInterpretation/p2t2获得。
    BACKGROUND: Genomic data are prevalent, leading to frequent encounters with uninterpreted variants or mutations with unknown mechanisms of effect. Researchers must manually aggregate data from multiple sources and across related proteins, mentally translating effects between the genome and proteome, to attempt to understand mechanisms.
    METHODS: P2T2 presents diverse data and annotation types in a unified protein-centric view, facilitating the interpretation of coding variants and hypothesis generation. Information from primary sequence, domain, motif, and structural levels are presented and also organized into the first Paralog Annotation Analysis across the human proteome.
    RESULTS: Our tool assists research efforts to interpret genomic variation by aggregating diverse, relevant, and proteome-wide information into a unified interactive web-based interface. Additionally, we provide a REST API enabling automated data queries, or repurposing data for other studies.
    CONCLUSIONS: The unified protein-centric interface presented in P2T2 will help researchers interpret novel variants identified through next-generation sequencing. Code and server link available at github.com/GenomicInterpretation/p2t2.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Comparative Study
    我们介绍了全蛋白质组关联研究(PWAS),一种检测由蛋白质功能改变介导的基因-表型关联的新方法。PWAS聚合共同影响蛋白质编码基因的所有变体的信号,并使用机器学习和概率模型评估它们对蛋白质功能的总体影响。随后,它测试该基因是否在个体之间表现出与感兴趣的表型相关的功能变异性。PWAS可以捕捉复杂的遗传力模式,包括隐性遗传。与GWAS和其他现有方法的比较证明了其恢复致病蛋白质编码基因并突出新关联的能力。PWAS可作为命令行工具使用。
    We introduce Proteome-Wide Association Study (PWAS), a new method for detecting gene-phenotype associations mediated by protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein\'s function using machine learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. PWAS can capture complex modes of heritability, including recessive inheritance. A comparison with GWAS and other existing methods proves its capacity to recover causal protein-coding genes and highlight new associations. PWAS is available as a command-line tool.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Post-translational modifications (PTMs) of protein amino acids are ubiquitous and important to protein function, localization, degradation, and more. In recent years, there has been an explosion in the discovery of PTMs as a result of improvements in PTM measurement techniques, including quantitative measurements of PTMs across multiple conditions. ProteomeScout is a repository for such discovery and quantitative experiments and provides tools for visualizing PTMs within proteins, including where they are relative to other PTMS, domains, mutations, and structure. ProteomeScout additionally provides analysis tools for identifying statistically significant relationships in experimental datasets. This unit describes four basic protocols for working with the ProteomeScout Web interface or programmatically with the database download. © 2017 by John Wiley & Sons, Inc.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    With the advent of high throughput techniques like Next Generation Sequencing, the amount of biological information for genes and proteins is growing faster than ever. Structural information is also rapidly growing, especially in the cryo Electron Microscopy area. However, in many cases, the proteomic and genomic data are spread in multiple databases and with no simple connection to structural information. In this work we present a new web platform that integrates EMDB/PDB structures and UniProt sequences with different sources of protein annotations. The application provides an interactive interface linking sequence and structure, including EM maps, presenting the different sources of information at sequence and structural level. The web application is available at http://3dbionotes.cnb.csic.es.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号