关键词: computational biology genome-scale annotation geometric deep learning language model multi-task none protein binding site systems biology

Mesh : Protein Binding Deep Learning Proteins / metabolism Binding Sites Peptides / metabolism

来  源:   DOI:10.7554/eLife.93695   PDF(Pubmed)

Abstract:
Revealing protein binding sites with other molecules, such as nucleic acids, peptides, or small ligands, sheds light on disease mechanism elucidation and novel drug design. With the explosive growth of proteins in sequence databases, how to accurately and efficiently identify these binding sites from sequences becomes essential. However, current methods mostly rely on expensive multiple sequence alignments or experimental protein structures, limiting their genome-scale applications. Besides, these methods haven\'t fully explored the geometry of the protein structures. Here, we propose GPSite, a multi-task network for simultaneously predicting binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions on proteins. GPSite was trained on informative sequence embeddings and predicted structures from protein language models, while comprehensively extracting residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even when the structures are not well-predicted. The low computational cost of GPSite enables rapid genome-scale binding residue annotations for over 568,000 sequences, providing opportunities to unveil unexplored associations of binding sites with molecular functions, biological processes, and genetic variants. The GPSite webserver and annotation database can be freely accessed at https://bio-web1.nscc-gz.cn/app/GPSite.
摘要:
揭示蛋白质与其他分子的结合位点,如核酸,肽,或者小配体,揭示了疾病机制的阐明和新药的设计。随着序列数据库中蛋白质的爆炸性增长,如何从序列中准确有效地识别这些结合位点变得至关重要。然而,目前的方法主要依赖于昂贵的多序列比对或实验性蛋白质结构,限制了它们的基因组规模应用。此外,这些方法还没有充分探索蛋白质结构的几何形状。这里,我们提议GPSite,同时预测DNA结合残基的多任务网络,RNA,肽,蛋白质,ATP,HEM,和蛋白质上的金属离子。GPSite接受了蛋白质语言模型的信息序列嵌入和预测结构的训练,同时以端到端的方式全面提取残差和关系几何上下文。实验表明,在各种基准数据集上,GPSite大大超过了最先进的基于序列和基于结构的方法,即使结构没有得到很好的预测。GPSite的低计算成本使超过568,000个序列的快速基因组规模结合残基注释成为可能。提供机会揭示未探索的结合位点与分子功能的关联,生物过程,和遗传变异。可以在https://bio-web1上自由访问GPSite网络服务器和注释数据库。nscc-gz.cn/app/GPSite.
公众号