DR - BERT ：一种注释无序区域的蛋白质语言模型。DR-BERT: A protein language model to annotate disordered regions.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR-BERT\'s ability to use contextual information.

摘要：

尽管它们缺乏刚性结构，蛋白质中的内在无序区域（IDR）在细胞功能中起重要作用，包括介导蛋白质-蛋白质相互作用。因此,以高精度计算注释IDR是很重要的。在这项研究中,我们使用来自变压器的双向编码器表示(DR-BERT)来预测无序区域，紧凑的蛋白质语言模型。与大多数流行的工具不同，DR-BERT在未注释的蛋白质上进行预训练，并训练以预测IDR，而不依赖于明确的进化或生物物理数据。尽管如此,DR-BERT在蛋白质内在障碍的关键评估（CAID）评估数据集上证明了对现有方法的显着改进，并且在CAID2数据集中的四个测试用例中的两个中胜过竞争对手。同时保持在其他领域的竞争力。这种表现是由于在预训练期间学习的信息和DR-BERT使用上下文信息的能力。