BERT

BERT
  • 文章类型: Journal Article
    毒性鉴定在维护人类健康中起着关键作用,因为它可以提醒人类长期接触各种化合物所造成的潜在危害。确定毒性的实验方法很耗时,而且昂贵,而计算方法为早期识别毒性提供了一种替代方法。例如,一些经典的ML和DL方法,在毒性预测中表现出优异的性能。然而,这些方法也有一些缺陷,例如过度依赖人工特征和容易过度拟合,等。提出具有优越预测性能的新模型仍然是一项紧迫的任务。在这项研究中,我们提出了一种基于motifs级图的多视图预训练语言模型,叫做3MTox,用于毒性鉴定。3MTox模型使用来自变压器的双向编码器表示(BERT)作为骨干框架,和一个图案图作为输入。大量实验的结果表明,我们的3MTox模型在毒性基准数据集上实现了最先进的性能,并且优于所考虑的基准模型。此外,模型的可解释性保证了它能快速准确地识别给定分子中的毒性位点,从而有助于确定毒性状态和相关分析。我们认为3MTox模型是目前可用于毒性鉴定的最有前途的工具之一。
    Toxicity identification plays a key role in maintaining human health, as it can alert humans to the potential hazards caused by long-term exposure to a wide variety of chemical compounds. Experimental methods for determining toxicity are time-consuming, and costly, while computational methods offer an alternative for the early identification of toxicity. For example, some classical ML and DL methods, which demonstrate excellent performance in toxicity prediction. However, these methods also have some defects, such as over-reliance on artificial features and easy overfitting, etc. Proposing novel models with superior prediction performance is still an urgent task. In this study, we propose a motifs-level graph-based multi-view pretraining language model, called 3MTox, for toxicity identification. The 3MTox model uses Bidirectional Encoder Representations from Transformers (BERT) as the backbone framework, and a motif graph as input. The results of extensive experiments showed that our 3MTox model achieved state-of-the-art performance on toxicity benchmark datasets and outperformed the baseline models considered. In addition, the interpretability of the model ensures that the it can quickly and accurately identify toxicity sites in a given molecule, thereby contributing to the determination of the status of toxicity and associated analyses. We think that the 3MTox model is among the most promising tools that are currently available for toxicity identification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在全球化的浪潮中,文化融合现象激增,突出强调跨文化交际中固有的挑战。为了应对这些挑战,当代研究已将重点转移到人机对话上。尤其是在人机对话的教育范式中,分析用户对话中的情感识别尤为重要。准确识别和理解用户的情感倾向以及人机交互和游戏的效率和体验。本研究旨在提高人机对话中的语言情感识别能力。它提出了一种基于来自变压器(BERT)的双向编码器表示的混合模型(BCBA),卷积神经网络(CNN),双向门控递归单位(BiGRU),注意机制。该模型利用BERT模型从文本中提取语义和句法特征。同时,它集成了CNN和BiGRU网络,以更深入地研究文本特征,增强模型在细致入微的情感识别方面的熟练程度。此外,通过引入注意力机制,该模型可以根据单词的情绪倾向为单词分配不同的权重。这使其能够优先考虑具有可辨别的情绪倾向的单词,以进行更精确的情绪分析。通过在两个数据集上的实验验证,BCBA模型在情感识别和分类任务中取得了显著的效果。该模型的准确性和F1得分都有了显著提高,平均准确率为0.84,平均F1评分为0.8。混淆矩阵分析揭示了该模型的最小分类错误率。此外,随着迭代次数的增加,模型的召回率稳定在约0.7。这一成就展示了该模型在语义理解和情感分析方面的强大功能,并展示了其在跨文化背景下处理语言表达中的情感特征方面的优势。本研究提出的BCBA模型为人机对话中的情感识别提供了有效的技术支持,这对于构建更加智能、人性化的人机交互系统具有重要意义。在未来,我们将继续优化模型的结构,提高其处理复杂情绪和跨语言情绪识别的能力,并探索将该模型应用于更多的实际场景,进一步促进人机对话技术的发展和应用。
    Amid the wave of globalization, the phenomenon of cultural amalgamation has surged in frequency, bringing to the fore the heightened prominence of challenges inherent in cross-cultural communication. To address these challenges, contemporary research has shifted its focus to human-computer dialogue. Especially in the educational paradigm of human-computer dialogue, analysing emotion recognition in user dialogues is particularly important. Accurately identify and understand users\' emotional tendencies and the efficiency and experience of human-computer interaction and play. This study aims to improve the capability of language emotion recognition in human-computer dialogue. It proposes a hybrid model (BCBA) based on bidirectional encoder representations from transformers (BERT), convolutional neural networks (CNN), bidirectional gated recurrent units (BiGRU), and the attention mechanism. This model leverages the BERT model to extract semantic and syntactic features from the text. Simultaneously, it integrates CNN and BiGRU networks to delve deeper into textual features, enhancing the model\'s proficiency in nuanced sentiment recognition. Furthermore, by introducing the attention mechanism, the model can assign different weights to words based on their emotional tendencies. This enables it to prioritize words with discernible emotional inclinations for more precise sentiment analysis. The BCBA model has achieved remarkable results in emotion recognition and classification tasks through experimental validation on two datasets. The model has significantly improved both accuracy and F1 scores, with an average accuracy of 0.84 and an average F1 score of 0.8. The confusion matrix analysis reveals a minimal classification error rate for this model. Additionally, as the number of iterations increases, the model\'s recall rate stabilizes at approximately 0.7. This accomplishment demonstrates the model\'s robust capabilities in semantic understanding and sentiment analysis and showcases its advantages in handling emotional characteristics in language expressions within a cross-cultural context. The BCBA model proposed in this study provides effective technical support for emotion recognition in human-computer dialogue, which is of great significance for building more intelligent and user-friendly human-computer interaction systems. In the future, we will continue to optimize the model\'s structure, improve its capability in handling complex emotions and cross-lingual emotion recognition, and explore applying the model to more practical scenarios to further promote the development and application of human-computer dialogue technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    推理解决是自然语言处理中的关键任务。很难评估大跨度文本的相似性,这使得文本级编码有些挑战。本文首先比较了常用的方法来提高模型的全局信息收集能力对BERT编码性能的影响。基于此,为了提高BERT编码模型在不同文本跨度下的适用性,设计了多尺度上下文信息模块。此外,通过尺寸扩展提高线性可分性。最后,使用交叉熵损失作为损失函数。在本文设计的模块中添加BERT和spanBERT后,F1分别增加了0.5%和0.2%,分别。
    Coreference resolution is a key task in Natural Language Processing. It is difficult to evaluate the similarity of long-span texts, which makes text-level encoding somewhat challenging. This paper first compares the impact of commonly used methods to improve the global information collection ability of the model on the BERT encoding performance. Based on this, a multi-scale context information module is designed to improve the applicability of the BERT encoding model under different text spans. In addition, improving linear separability through dimension expansion. Finally, cross-entropy loss is used as the loss function. After adding BERT and span BERT to the module designed in this article, F1 increased by 0.5% and 0.2%, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着自然语言处理(NLP)的快速发展,预训练语言模型(PLM),如BERT、Biobert,ChatGPT在各种医学NLP任务中显示出巨大的潜力。本文调查了将PLM应用于各种医学NLP任务的前沿成就。具体来说,我们首先简要介绍PLMS,概述PLMS在医学中的研究。接下来,我们对医学NLP中的任务类型进行分类和讨论,涵盖文本摘要,问答,机器翻译,情绪分析,命名实体识别,信息提取,医学教育,关系提取,和文本挖掘。对于每种类型的任务,我们首先提供基本概念的概述,主要方法,应用PLM的优势,应用PLM应用程序的基本步骤,用于培训和测试的数据集,以及任务评估的指标。随后,总结了最近的重要研究成果,分析他们的动机,优势与劣势,相似性与差异性,讨论潜在的限制。此外,我们通过比较被审查论文的引文数和发表论文的会议和期刊的声誉和影响来评估本文所审查研究的质量和影响力。通过这些指标,我们进一步确定了当前最关注的研究课题。最后,我们期待着未来的研究方向,包括增强模型的可靠性,可解释性,和公平,促进PLMs在临床实践中的应用。此外,本次调查还收集了一些模型代码和相关数据集的下载链接,这对于在医学中应用NLP技术的研究人员和寻求通过AI技术增强其专业知识和医疗保健服务的医疗专业人员来说是有价值的参考。
    With the rapid progress in Natural Language Processing (NLP), Pre-trained Language Models (PLM) such as BERT, BioBERT, and ChatGPT have shown great potential in various medical NLP tasks. This paper surveys the cutting-edge achievements in applying PLMs to various medical NLP tasks. Specifically, we first brief PLMS and outline the research of PLMs in medicine. Next, we categorise and discuss the types of tasks in medical NLP, covering text summarisation, question-answering, machine translation, sentiment analysis, named entity recognition, information extraction, medical education, relation extraction, and text mining. For each type of task, we first provide an overview of the basic concepts, the main methodologies, the advantages of applying PLMs, the basic steps of applying PLMs application, the datasets for training and testing, and the metrics for task evaluation. Subsequently, a summary of recent important research findings is presented, analysing their motivations, strengths vs weaknesses, similarities vs differences, and discussing potential limitations. Also, we assess the quality and influence of the research reviewed in this paper by comparing the citation count of the papers reviewed and the reputation and impact of the conferences and journals where they are published. Through these indicators, we further identify the most concerned research topics currently. Finally, we look forward to future research directions, including enhancing models\' reliability, explainability, and fairness, to promote the application of PLMs in clinical practice. In addition, this survey also collect some download links of some model codes and the relevant datasets, which are valuable references for researchers applying NLP techniques in medicine and medical professionals seeking to enhance their expertise and healthcare service through AI technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    知识图完成旨在预测知识图中实体之间的缺失关系。知识图嵌入是知识图完成的有效途径之一。然而,现有的嵌入方法通常专注于开发更深入、更复杂的神经网络,或利用其他信息,这不可避免地增加了计算复杂性,并且对实时应用程序不友好。在这篇文章中,我们提出了一种有效的BERT增强的浅层神经网络模型,用于知识图的完成,称为ShallowBKGC。具体来说,给定一个实体对,我们首先应用预训练的语言模型BERT来提取头部和尾部实体的文本特征。同时,我们使用嵌入层提取头部和尾部实体的结构特征。然后,通过平均运算将文本和结构特征集成到一个实体对表示中,然后进行非线性变换。最后,基于实体对表示,我们通过多标签建模来计算每个关系的概率,以预测给定实体对的关系。在三个基准数据集上的实验结果表明,与基准方法相比,我们的模型具有优越的性能。本文的源代码可以从https://github.com/Joni-gogogo/ShallowBKGC获得。
    Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. One of the effective ways for knowledge graph completion is knowledge graph embedding. However, existing embedding methods usually focus on developing deeper and more complex neural networks, or leveraging additional information, which inevitably increases computational complexity and is unfriendly to real-time applications. In this article, we propose an effective BERT-enhanced shallow neural network model for knowledge graph completion named ShallowBKGC. Specifically, given an entity pair, we first apply the pre-trained language model BERT to extract text features of head and tail entities. At the same time, we use the embedding layer to extract structure features of head and tail entities. Then the text and structure features are integrated into one entity-pair representation via average operation followed by a non-linear transformation. Finally, based on the entity-pair representation, we calculate probability of each relation through multi-label modeling to predict relations for the given entity pair. Experimental results on three benchmark datasets show that our model achieves a superior performance in comparison with baseline methods. The source code of this article can be obtained from https://github.com/Joni-gogogo/ShallowBKGC.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    关系预测是知识图谱完成和相关下游任务中依赖于知识表示的关键任务。先前的研究表明,结构特征和语义信息对于预测知识图中的缺失关系都是有意义的。这导致了两类方法的发展:基于结构的方法和基于语义的方法。由于这两种方法代表了两种不同的学习范式,在单个学习模型中很难充分利用这两组特征,尤其是深层特征。因此,现有的研究通常只关注一种类型的特征。这导致当前方法中的知识表示不足,并使它们在预测缺失关系时容易忽略某些模式。在这项研究中,我们介绍一个新的模型,RP-ISS,结合深层语义和结构特征进行关系预测。RP-ISS模型采用两部分架构,第一个组件是一个RoBERTa模块,负责从实体节点中提取语义特征。系统的第二部分采用基于边缘的关系消息传递网络,旨在捕获和解释数据中的结构信息。为了减轻采样过程中消息传递网络对RoBERTa模块的计算负担,RP-ISS引入了一个节点嵌入存储体,它异步更新以规避过多的计算。该模型在三个可公开访问的数据集上进行了评估(WN18RR,WN18和FB15k-237),结果表明,RP-ISS在所有评估指标中都超过了所有基线方法。此外,RP-ISS展示了图归纳学习中的鲁棒性能。
    Relation prediction is a critical task in knowledge graph completion and associated downstream tasks that rely on knowledge representation. Previous studies indicate that both structural features and semantic information are meaningful for predicting missing relations in knowledge graphs. This has led to the development of two types of methods: structure-based methods and semantics-based methods. Since these two approaches represent two distinct learning paradigms, it is difficult to fully utilize both sets of features within a single learning model, especially deep features. As a result, existing studies usually focus on only one type of feature. This leads to an insufficient representation of knowledge in current methods and makes them prone to overlooking certain patterns when predicting missing relations. In this study, we introduce a novel model, RP-ISS, which combines deep semantic and structural features for relation prediction. The RP-ISS model utilizes a two-part architecture, with the first component being a RoBERTa module that is responsible for extracting semantic features from entity nodes. The second part of the system employs an edge-based relational message-passing network designed to capture and interpret structural information within the data. To alleviate the computational burden of the message-passing network on the RoBERTa module during the sampling process, RP-ISS introduces a node embedding memory bank, which updates asynchronously to circumvent excessive computation. The model was assessed on three publicly accessible datasets (WN18RR, WN18, and FB15k-237), and the results revealed that RP-ISS surpasses all baseline methods across all evaluation metrics. Moreover, RP-ISS showcases robust performance in graph inductive learning.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:启动子是DNA中具有转录调节功能的特定序列,在启动基因表达中发挥作用。鉴定启动子及其优势可以提供与人类疾病相关的有价值的信息。近年来,计算方法作为识别启动子的有效手段,为劳动密集型生物学方法提供了更有效的替代方案。
    结果:在这项研究中,提出了一种称为“msBERT-启动子”的两阶段综合预测因子,用于鉴定启动子并预测其优势。该模型通过标记化策略结合了多尺度序列信息,并对DNABERT模型进行了微调。然后使用软投票来融合多尺度信息,有效地解决了传统模型中DNA序列信息提取不足的问题。据我们所知,这是DNABERT模型首次采用综合方法进行启动子鉴定和强度预测.我们的模型对启动子识别的准确率为96.2%,对启动子强度预测的准确率为79.8%。显著优于现有方法。此外,通过注意力机制分析,我们证明了我们的模型可以有效地结合局部和全局序列信息,增强其可解释性。
    结论:msBERT-启动子提供了一种有效的工具,可以成功捕获DNA启动子的序列相关属性,并可以准确地识别启动子并预测其强度。这项工作为人工智能在传统生物学中的应用铺平了一条新的道路。
    BACKGROUND: A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches.
    RESULTS: In this study, a two-stage integrated predictor called \"msBERT-Promoter\" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability.
    CONCLUSIONS: msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    转录因子(TF)是通过与DNA序列中的转录因子结合位点(TFBS)结合来调节遗传转录所必需的蛋白质。TFBS的准确预测可以有助于基于TFs的代谢调节系统的设计和构建。尽管已经开发了各种深度学习算法来预测TFBS,预测性能有待提高。本文提出了一种基于变换器(BERT)的双向编码器表示模型,叫做BERT-TFBS,仅基于DNA序列来预测TFBS。该模型由预先训练的BERT模块(DNABERT-2)组成,卷积神经网络(CNN)模块,卷积块注意模块(CBAM)和输出模块。BERT-TFBS模型利用预先训练的DNABERT-2模块通过迁移学习方法获取DNA序列中复杂的长期依赖关系,并应用CNN模块和CBAM提取高阶局部特征。所提出的模型是基于165个ENCODEChIP-seq数据集进行训练和测试的。我们用模型变体进行了实验,跨细胞系验证以及与其他模型的比较。实验结果证明了BERT-TFBS预测TFBS的有效性和泛化能力,他们表明,所提出的模型优于其他深度学习模型。BERT-TFBS的源代码可在https://github.com/ZX1998-12/BERT-TFBS获得。
    Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    癌症,一个重要的全球公共卫生问题,2022年导致约1000万人死亡。抗癌肽(ACPs),作为一类生物活性肽,已成为临床癌症研究的焦点,因为它们具有抑制肿瘤细胞增殖的潜力,副作用最小。然而,通过湿实验室实验识别ACP仍然面临效率低和成本高的挑战。我们的工作提出了一种基于深度表示学习的ACP-DRL的ACP识别方法,以解决与湿实验室实验中识别ACP相关的挑战。ACP-DRL标志着将蛋白质语言模型整合到ACP识别中的初步探索,采用领域内的进一步预培训,以加强深度表示学习的发展。同时,它采用双向长短期记忆网络从序列中提取氨基酸特征。因此,ACP-DRL消除了对序列长度的约束和对手动特征的依赖,与现有方法相比,展示了显著的竞争力。
    Cancer, a significant global public health issue, resulted in about 10 million deaths in 2022. Anticancer peptides (ACPs), as a category of bioactive peptides, have emerged as a focal point in clinical cancer research due to their potential to inhibit tumor cell proliferation with minimal side effects. However, the recognition of ACPs through wet-lab experiments still faces challenges of low efficiency and high cost. Our work proposes a recognition method for ACPs named ACP-DRL based on deep representation learning, to address the challenges associated with the recognition of ACPs in wet-lab experiments. ACP-DRL marks initial exploration of integrating protein language models into ACPs recognition, employing in-domain further pre-training to enhance the development of deep representation learning. Simultaneously, it employs bidirectional long short-term memory networks to extract amino acid features from sequences. Consequently, ACP-DRL eliminates constraints on sequence length and the dependence on manual features, showcasing remarkable competitiveness in comparison with existing methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    质粒介导的抗性基因(ARG)传播的新兴问题对环境完整性构成了重大威胁。然而,ARGs患病率的预测被忽视,特别是对于潜在进化的基因交换热点的新兴ARGs。这里,我们探索使用DNABERT对质粒或染色体序列进行分类,并检测耐药基因的患病率.最初,DNABERT对质粒和染色体序列进行微调,然后再进行多层感知器(MLP)分类器可以在23个属的外部数据集上实现0.764AUC(曲线下面积),比传统的基于统计的模型表现优于0.02AUC。此外,埃希氏菌,还训练了基于假单胞菌单属的模型,以探索其对ARG患病率检测的预测性能。通过整合K-mer频率属性,我们的模型可以提高预测外部数据集中ARGs在0.0281-0.0615AUC的埃希氏菌和0.0196-0.0928AUC的患病率的性能.最后,我们建立了一个随机森林模型,旨在以0.7956AUC预测质粒的相对共轭转移率,借鉴现有文献中的数据。它确定了质粒的抑制状态,细胞密度,温度是影响传递频率的最重要因素。结合了这两种模型,它们为快速低成本的抗性基因转移综合评估提供了有用的参考,加快环境领域ARGs转移计算机辅助定量风险评估的进程。
    The burgeoning issue of plasmid-mediated resistance genes (ARGs) dissemination poses a significant threat to environmental integrity. However, the prediction of ARGs prevalence is overlooked, especially for emerging ARGs that are potentially evolving gene exchange hotspot. Here, we explored to classify plasmid or chromosome sequences and detect resistance gene prevalence by using DNABERT. Initially, the DNABERT fine-tuned in plasmid and chromosome sequences followed by multilayer perceptron (MLP) classifier could achieve 0.764 AUC (Area under curve) on external datasets across 23 genera, outperforming 0.02 AUC than traditional statistic-based model. Furthermore, Escherichia, Pseudomonas single genera based model were also be trained to explore its predict performance to ARGs prevalence detection. By integrating K-mer frequency attributes, our model could boost the performance to predict the prevalence of ARGs in an external dataset in Escherichia with 0.0281-0.0615 AUC and Pseudomonas with 0.0196-0.0928 AUC. Finally, we established a random forest model aimed at forecasting the relative conjugation transfer rate of plasmids with 0.7956 AUC, drawing on data from existing literature. It identifies the plasmid\'s repression status, cellular density, and temperature as the most important factors influencing transfer frequency. With these two models combined, they provide useful reference for quick and low-cost integrated evaluation of resistance gene transfer, accelerating the process of computer-assisted quantitative risk assessment of ARGs transfer in environmental field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号