Pre-trained language model

预训练语言模型
  • 文章类型: Journal Article
    了解疾病的遗传基础是医学研究的一个基本方面,基因是遗传的经典单位,在生物学功能中起着至关重要的作用。识别基因与疾病之间的关联对诊断至关重要,预防,预后,和药物开发。编码具有相似序列的蛋白质的基因通常与相关疾病有关,导致相同或相似疾病的蛋白质往往在其序列中显示有限的变异。预测基因-疾病关联(GDA)需要对大量潜在的候选基因进行耗时且昂贵的实验。尽管已经提出了使用传统机器学习算法和图神经网络来预测基因与疾病之间关联的方法,这些方法很难捕获基因和疾病中的深层语义信息,并且依赖于训练数据。为了缓解这个问题,我们提出了一种新的GDA预测模型FusionGDA,它利用带有融合模块的预训练阶段来丰富由预训练语言模型编码的基因和疾病语义表示。多模态表示由融合模块生成,其中包括有关两个异质生物医学实体的丰富语义信息:蛋白质序列和疾病描述。随后,采用池化聚合策略压缩多模态表示的维度。此外,FusionGDA采用预训练阶段,利用对比学习损失,通过在大型公共GDA数据集上训练来提取潜在的基因和疾病特征。为了严格评估FusionGDA模型的有效性,我们在五个数据集上进行了全面的实验,并将我们提出的模型与DisGeNet-Eval数据集上的五个竞争基准模型进行了比较。值得注意的是,我们的案例研究进一步证明了FusionGDA有效发现隐藏关联的能力。我们实验的完整代码和数据集可在https://github.com/ZhahanM/FusionGDA获得。
    Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:急诊室(ER)设置的紧迫性和复杂性要求对患者护理进行精确而迅速的决策过程。确保及时执行关键检查和干预措施对于减少诊断错误至关重要,但文献强调需要创新方法来优化诊断准确性和患者结局.作为回应,我们的研究努力创建预测模型,及时检查和干预利用患者的症状和生命体征记录在分诊,这样做,增强传统的诊断方法。
    方法:专注于四个关键领域-药物分配,重要的干预措施,实验室测试,和紧急放射学检查,该研究采用了自然语言处理(NLP)和七种先进的机器学习技术。这项研究围绕着BioClinicalBERT的创新使用,最先进的NLP框架。
    结果:BioClinicalBERT成为优越的模型,在预测准确性方面优于其他人。与仅基于文本数据的模型相比,生理数据与患者叙事症状的集成显示出更大的有效性。我们的方法的稳健性由0.9的接收器操作特征曲线下面积(AUROC)评分证实。
    结论:我们的研究结果强调了为急诊患者建立决策支持系统的可行性,根据对症状的细致分析,及时进行干预和检查。通过使用先进的自然语言处理技术,我们的方法有望提高诊断准确性.然而,目前的模式尚未完全成熟,无法直接应用于日常临床实践.认识到ER环境中精度的必要性,未来的研究工作必须集中在完善和扩展预测模型,以包括详细的及时检查和干预措施。虽然在这项研究中取得的进展代表了一个令人鼓舞的步骤,在急诊护理中,一个更具创新性和技术驱动的范式,全面的临床整合需要进一步的探索和验证。
    BACKGROUND: The urgency and complexity of emergency room (ER) settings require precise and swift decision-making processes for patient care. Ensuring the timely execution of critical examinations and interventions is vital for reducing diagnostic errors, but the literature highlights a need for innovative approaches to optimize diagnostic accuracy and patient outcomes. In response, our study endeavors to create predictive models for timely examinations and interventions by leveraging the patient\'s symptoms and vital signs recorded during triage, and in so doing, augment traditional diagnostic methodologies.
    METHODS: Focusing on four key areas-medication dispensing, vital interventions, laboratory testing, and emergency radiology exams, the study employed Natural Language Processing (NLP) and seven advanced machine learning techniques. The research was centered around the innovative use of BioClinicalBERT, a state-of-the-art NLP framework.
    RESULTS: BioClinicalBERT emerged as the superior model, outperforming others in predictive accuracy. The integration of physiological data with patient narrative symptoms demonstrated greater effectiveness compared to models based solely on textual data. The robustness of our approach was confirmed by an Area Under the Receiver Operating Characteristic curve (AUROC) score of 0.9.
    CONCLUSIONS: The findings of our study underscore the feasibility of establishing a decision support system for emergency patients, targeting timely interventions and examinations based on a nuanced analysis of symptoms. By using an advanced natural language processing technique, our approach shows promise for enhancing diagnostic accuracy. However, the current model is not yet fully mature for direct implementation into daily clinical practice. Recognizing the imperative nature of precision in the ER environment, future research endeavors must focus on refining and expanding predictive models to include detailed timely examinations and interventions. Although the progress achieved in this study represents an encouraging step towards a more innovative and technology-driven paradigm in emergency care, full clinical integration warrants further exploration and validation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:主动学习(AL)很少将基于多样性和基于不确定性的策略集成到用于临床命名实体识别(NER)的动态采样框架中。机器辅助注释在创建黄金标准标签方面变得越来越流行。这项研究调查了在模拟的机器辅助注释场景下动态AL策略对临床NER的有效性。
    方法:我们提出了3种新的AL策略:基于Sentence-BERT的基于多样性的策略(CLUSTER)和2种能够从基于多样性的策略转换为基于不确定性的策略的动态策略(CLC和CNBSE)。使用BioClinicalBERT作为基础NER模型,我们独立对3个与药物相关的临床NER数据集进行了模拟实验:i2b22009,n2c22018(轨道2),和制造1.0。我们将提出的策略与基于不确定性(LC和NBSE)和被动学习(RANDOM)策略进行了比较。性能主要通过注释者为实现在独立测试集上评估的期望目标有效性而进行的编辑数量来衡量。
    结果:当目标是98%的整体目标有效性时,平均而言,CLUSTER需要最少的编辑。当瞄准99%的总体目标有效性时,CNBSE需要的编辑比NBSE少20.4%。在基于池的仿真实验下,CLUSTER和RANDOM无法实现如此高的目标。对于高难度实体,CNBSE需要比NBSE少22.5%的编辑才能实现99%的目标有效性,而集群和随机都没有达到93%的目标有效性。
    结论:当目标有效性设定为高时,提出的动态策略CNBSE在机器辅助注释中表现出强大的学习能力和较低的注释成本。当目标有效性设置为较低时,CLUSTER需要的编辑最少。
    OBJECTIVE: Active learning (AL) has rarely integrated diversity-based and uncertainty-based strategies into a dynamic sampling framework for clinical named entity recognition (NER). Machine-assisted annotation is becoming popular for creating gold-standard labels. This study investigated the effectiveness of dynamic AL strategies under simulated machine-assisted annotation scenarios for clinical NER.
    METHODS: We proposed 3 new AL strategies: a diversity-based strategy (CLUSTER) based on Sentence-BERT and 2 dynamic strategies (CLC and CNBSE) capable of switching from diversity-based to uncertainty-based strategies. Using BioClinicalBERT as the foundational NER model, we conducted simulation experiments on 3 medication-related clinical NER datasets independently: i2b2 2009, n2c2 2018 (Track 2), and MADE 1.0. We compared the proposed strategies with uncertainty-based (LC and NBSE) and passive-learning (RANDOM) strategies. Performance was primarily measured by the number of edits made by the annotators to achieve a desired target effectiveness evaluated on independent test sets.
    RESULTS: When aiming for 98% overall target effectiveness, on average, CLUSTER required the fewest edits. When aiming for 99% overall target effectiveness, CNBSE required 20.4% fewer edits than NBSE did. CLUSTER and RANDOM could not achieve such a high target under the pool-based simulation experiment. For high-difficulty entities, CNBSE required 22.5% fewer edits than NBSE to achieve 99% target effectiveness, whereas neither CLUSTER nor RANDOM achieved 93% target effectiveness.
    CONCLUSIONS: When the target effectiveness was set high, the proposed dynamic strategy CNBSE exhibited both strong learning capabilities and low annotation costs in machine-assisted annotation. CLUSTER required the fewest edits when the target effectiveness was set low.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    电力行业物联网的发展,数据交互的安全性一直是一个重要的挑战。在基于电力的区块链工业物联网中,节点数据交互涉及大量的敏感数据。在当前电力业务数据交互的防泄漏策略中,正则表达式用于识别敏感数据以进行匹配。这种方法仅适用于简单的结构化数据。对于非结构化数据的处理,缺乏实用的匹配策略。因此,本文提出了一种基于深度学习的电力业务数据交互防泄漏方法,旨在保障国家电网业务平台与第三方平台之间电力业务数据交互的安全性。该方法结合了命名实体识别技术,并综合使用了正则表达式和DeBERTa(带有解纠缠注意力的解码增强BERT)-BiLSTM(双向长短期记忆)-CRF(条件随机场)模型。该方法基于DeBERTa(具有解纠缠注意力的解码增强BERT)模型,用于训练前的特征提取。它通过BiLSTM提取序列上下文语义特征,最后通过CRF层标签序列得到全局最优。对交互式结构化和非结构化数据进行敏感数据匹配,以识别电力业务中的隐私敏感信息。实验结果表明,本文提出的方法利用CLUENER2020数据集识别敏感数据实体的F1得分达到81.26%,能有效防范电力业务数据泄露风险,为电力行业提供保障数据安全的创新解决方案。
    In the development of the Power Industry Internet of Things, the security of data interaction has always been an important challenge. In the power-based blockchain Industrial Internet of Things, node data interaction involves a large amount of sensitive data. In the current anti-leakage strategy for power business data interaction, regular expressions are used to identify sensitive data for matching. This approach is only suitable for simple structured data. For the processing of unstructured data, there is a lack of practical matching strategies. Therefore, this paper proposes a deep learning-based anti-leakage method for power business data interaction, aiming to ensure the security of power business data interaction between the State Grid business platform and third-party platforms. This method combines named entity recognition technologies and comprehensively uses regular expressions and the DeBERTa (Decoding-enhanced BERT with disentangled attention)-BiLSTM (Bidirectional Long Short-Term Memory)-CRF (Conditional Random Field) model. This method is based on the DeBERTa (Decoding-enhanced BERT with disentangled attention) model for pre-training feature extraction. It extracts sequence context semantic features through the BiLSTM, and finally obtains the global optimal through the CRF layer tag sequence. Sensitive data matching is performed on interactive structured and unstructured data to identify privacy-sensitive information in the power business. The experimental results show that the F1 score of the proposed method in this paper for identifying sensitive data entities using the CLUENER 2020 dataset reaches 81.26%, which can effectively prevent the risk of power business data leakage and provide innovative solutions for the power industry to ensure data security.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着自然语言处理(NLP)的快速发展,预训练语言模型(PLM),如BERT、Biobert,ChatGPT在各种医学NLP任务中显示出巨大的潜力。本文调查了将PLM应用于各种医学NLP任务的前沿成就。具体来说,我们首先简要介绍PLMS,概述PLMS在医学中的研究。接下来,我们对医学NLP中的任务类型进行分类和讨论,涵盖文本摘要,问答,机器翻译,情绪分析,命名实体识别,信息提取,医学教育,关系提取,和文本挖掘。对于每种类型的任务,我们首先提供基本概念的概述,主要方法,应用PLM的优势,应用PLM应用程序的基本步骤,用于培训和测试的数据集,以及任务评估的指标。随后,总结了最近的重要研究成果,分析他们的动机,优势与劣势,相似性与差异性,讨论潜在的限制。此外,我们通过比较被审查论文的引文数和发表论文的会议和期刊的声誉和影响来评估本文所审查研究的质量和影响力。通过这些指标,我们进一步确定了当前最关注的研究课题。最后,我们期待着未来的研究方向,包括增强模型的可靠性,可解释性,和公平,促进PLMs在临床实践中的应用。此外,本次调查还收集了一些模型代码和相关数据集的下载链接,这对于在医学中应用NLP技术的研究人员和寻求通过AI技术增强其专业知识和医疗保健服务的医疗专业人员来说是有价值的参考。
    With the rapid progress in Natural Language Processing (NLP), Pre-trained Language Models (PLM) such as BERT, BioBERT, and ChatGPT have shown great potential in various medical NLP tasks. This paper surveys the cutting-edge achievements in applying PLMs to various medical NLP tasks. Specifically, we first brief PLMS and outline the research of PLMs in medicine. Next, we categorise and discuss the types of tasks in medical NLP, covering text summarisation, question-answering, machine translation, sentiment analysis, named entity recognition, information extraction, medical education, relation extraction, and text mining. For each type of task, we first provide an overview of the basic concepts, the main methodologies, the advantages of applying PLMs, the basic steps of applying PLMs application, the datasets for training and testing, and the metrics for task evaluation. Subsequently, a summary of recent important research findings is presented, analysing their motivations, strengths vs weaknesses, similarities vs differences, and discussing potential limitations. Also, we assess the quality and influence of the research reviewed in this paper by comparing the citation count of the papers reviewed and the reputation and impact of the conferences and journals where they are published. Through these indicators, we further identify the most concerned research topics currently. Finally, we look forward to future research directions, including enhancing models\' reliability, explainability, and fairness, to promote the application of PLMs in clinical practice. In addition, this survey also collect some download links of some model codes and the relevant datasets, which are valuable references for researchers applying NLP techniques in medicine and medical professionals seeking to enhance their expertise and healthcare service through AI technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    急诊科(ED)人满为患的患病率日益增加,威胁着有效提供紧急医疗服务。缓解策略包括部署能够跟踪和管理患者处置的监测系统,以促进适当和及时的护理,这随后减少了患者的重诊,优化资源分配,并提高患者的治疗效果。这项研究使用台北医科大学双河医院的25万份急诊科就诊记录,使用Bluebert开发自然语言处理模型,生物医学领域特定的预训练语言模型,预测患者处置状态和计划外再入院。数据预处理以及结构化和非结构化数据的集成是我们方法的核心。与其他型号相比,Bluebert由于对各种医学文献的预培训而表现出色,使其能够更好地理解专业术语,关系,和ED数据中存在的上下文。我们发现,将中英文临床叙述翻译成英文并将数值数据文本化为分类表示,可以显着改善患者倾向的预测(AUROC=0.9014)和72小时计划外回诊(AUROC=0.6475)。研究得出的结论是,基于Bluebert的模型表现出优越的预测能力,超越先前患者处置预测模型的性能,因此在ED临床实践领域提供了有希望的应用。
    The increasing prevalence of overcrowding in Emergency Departments (EDs) threatens the effective delivery of urgent healthcare. Mitigation strategies include the deployment of monitoring systems capable of tracking and managing patient disposition to facilitate appropriate and timely care, which subsequently reduces patient revisits, optimizes resource allocation, and enhances patient outcomes. This study used ∼ 250,000 emergency department visit records from Taipei Medical University-Shuang Ho Hospital to develop a natural language processing model using BlueBERT, a biomedical domain-specific pre-trained language model, to predict patient disposition status and unplanned readmissions. Data preprocessing and the integration of both structured and unstructured data were central to our approach. Compared to other models, BlueBERT outperformed due to its pre-training on a diverse range of medical literature, enabling it to better comprehend the specialized terminology, relationships, and context present in ED data. We found that translating Chinese-English clinical narratives into English and textualizing numerical data into categorical representations significantly improved the prediction of patient disposition (AUROC = 0.9014) and 72-hour unscheduled return visits (AUROC = 0.6475). The study concludes that the BlueBERT-based model demonstrated superior prediction capabilities, surpassing the performance of prior patient disposition predictive models, thus offering promising applications in the realm of ED clinical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    预训练语言模型(PLM)是当今无监督句子表示学习(USRL)的支柱。然而,PLM对训练前语料库中单词的频率信息敏感,导致各向异性的嵌入空间,其中高频词的嵌入是聚集的,而低频词的嵌入是稀疏分散的。这种各向异性现象导致相似性偏差和信息偏差两个问题,降低句子嵌入的质量。为了解决问题,我们通过利用单词的频率信息来微调PLM,并提出了一个新的USRL框架,即具有频率诱导的对抗性调谐和不完整句子过滤的句子表示学习(Slt-fai)。我们在PLM的预训练语料库上计算单词频率,并为单词分配阈值频率标签。和他们在一起,(1)我们加入了一个相似度鉴别器,用于区分高频和低频词的嵌入,用它对抗性地调谐PLM,能够实现均匀频率不变的嵌入空间;(2)我们提出了一种新颖的不完整句子检测任务,在这里,我们结合了一个信息鉴别器,通过随机掩蔽几个低频单词来区分原始句子和不完整句子的嵌入,能够强调信息更丰富的低频单词。我们的Slt-fai是一个灵活的即插即用框架,它可以与现有的USRL技术集成。我们在基准数据集上使用各种骨干来评估Slt-fai。实证结果表明,Slt-fai可以优于现有的USRL基线。
    Pre-trained Language Model (PLM) is nowadays the mainstay of Unsupervised Sentence Representation Learning (USRL). However, PLMs are sensitive to the frequency information of words from their pre-training corpora, resulting in anisotropic embedding space, where the embeddings of high-frequency words are clustered but those of low-frequency words disperse sparsely. This anisotropic phenomenon results in two problems of similarity bias and information bias, lowering the quality of sentence embeddings. To solve the problems, we fine-tune PLMs by leveraging the frequency information of words and propose a novel USRL framework, namely Sentence Representation Learning with Frequency-induced Adversarial tuning and Incomplete sentence filtering (Slt-fai). We calculate the word frequencies over the pre-training corpora of PLMs and assign words thresholding frequency labels. With them, (1) we incorporate a similarity discriminator used to distinguish the embeddings of high-frequency and low-frequency words, and adversarially tune the PLM with it, enabling to achieve uniformly frequency-invariant embedding space; and (2) we propose a novel incomplete sentence detection task, where we incorporate an information discriminator to distinguish the embeddings of original sentences and incomplete sentences by randomly masking several low-frequency words, enabling to emphasize the more informative low-frequency words. Our Slt-fai is a flexible and plug-and-play framework, and it can be integrated with existing USRL techniques. We evaluate Slt-fai with various backbones on benchmark datasets. Empirical results indicate that Slt-fai can be superior to the existing USRL baselines.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    膜蛋白如离子通道和转运蛋白的准确分类对于阐明细胞过程和药物开发至关重要。我们介绍DeepPLM_mCNN,一种结合预训练语言模型(PLM)和多窗口卷积神经网络(mCNN)的新框架,用于将膜蛋白有效分类为离子通道和离子转运蛋白。我们的方法通过利用各种PLM从蛋白质序列中提取信息特征,包括磁带,ProtT5_XL_U50,ESM-1b,ESM-2_480和ESM-2_1280。然后将这些PLM导出的特征输入到mCNN架构中以学习对分类重要的保守基序。当在离子转运蛋白上评估时,我们利用ProtT5的最佳性能模型实现了90%的灵敏度,95.8%的特异性,总体准确率为95.4%。对于离子通道,我们获得了88.3%的灵敏度,95.7%的特异性,和95.2%的整体精度使用ESM-1b功能。我们提出的DeepPLM_mCNN框架在看不见的测试数据上证明了对以前方法的重大改进。这项研究说明了将PLM和深度学习相结合,仅从序列数据中准确计算鉴定膜蛋白的潜力。我们的发现对膜蛋白研究和靶向离子通道和转运蛋白的药物开发具有重要意义。本研究中的数据和源代码可通过以下链接公开获得:https://github.com/s1129108/DeepPLM_mCNN。
    Accurate classification of membrane proteins like ion channels and transporters is critical for elucidating cellular processes and drug development. We present DeepPLM_mCNN, a novel framework combining Pretrained Language Models (PLMs) and multi-window convolutional neural networks (mCNNs) for effective classification of membrane proteins into ion channels and ion transporters. Our approach extracts informative features from protein sequences by utilizing various PLMs, including TAPE, ProtT5_XL_U50, ESM-1b, ESM-2_480, and ESM-2_1280. These PLM-derived features are then input into a mCNN architecture to learn conserved motifs important for classification. When evaluated on ion transporters, our best performing model utilizing ProtT5 achieved 90% sensitivity, 95.8% specificity, and 95.4% overall accuracy. For ion channels, we obtained 88.3% sensitivity, 95.7% specificity, and 95.2% overall accuracy using ESM-1b features. Our proposed DeepPLM_mCNN framework demonstrates significant improvements over previous methods on unseen test data. This study illustrates the potential of combining PLMs and deep learning for accurate computational identification of membrane proteins from sequence data alone. Our findings have important implications for membrane protein research and drug development targeting ion channels and transporters. The data and source codes in this study are publicly available at the following link: https://github.com/s1129108/DeepPLM_mCNN.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:单细胞RNA测序(scRNA-seq)数据的细胞类型注释是生物医学研究和临床应用的标志。当前的注释工具通常假设同时获取注释良好的数据,但是没有能力从新数据中扩展知识。然而,这些工具与scRNA-seq数据的不断涌现不一致,调用连续单元类型注释模型。此外,凭借其强大的信息集成能力和模型可解释性,基于变压器的预训练语言模型导致了单细胞生物学研究的突破。因此,连续学习和预训练语言模型的系统结合的细胞类型的注释任务是不可避免的。
    结果:我们在此提出了一种通用的细胞类型注释工具,叫做运河,不断微调在大量未标记的scRNA-seq数据上训练的预训练语言模型,随着新的标记良好的数据出现。CANAL基本上缓解了灾难性遗忘的困境,在模型输入和输出方面。对于模型输入,我们引入了一个经验回放模式,该模式在当前的培训阶段反复回顾以前的重要例子。这是通过具有固定缓冲器大小的动态示例库来实现的。该示例银行是类平衡的,并且精通保留特定于单元格类型的信息,特别是促进与稀有细胞类型相关的模式的巩固。对于模型输出,我们利用表示知识蒸馏来正则化以前模型和当前模型之间的分歧,从而保留了从过去的培训阶段学到的知识。此外,我们的通用注释框架考虑在整个微调和测试阶段包含新的细胞类型。我们可以通过从新到达的细胞中吸收新的细胞类型来不断扩大细胞类型注释库,注释好的训练数据集,以及自动识别未标记数据集中的新细胞。在各种生物场景下对数据流进行的综合实验证明了CANAL的多功能性和高模型可解释性。
    背景:CANAL的实现可从https://github.com/aster-ww/CANAL-torch获得。
    背景:dengmh@pku。edu.cn.
    背景:可在在线期刊名称上获得补充数据。
    BACKGROUND: Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable.
    RESULTS: We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL.
    BACKGROUND: An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch.
    BACKGROUND: dengmh@pku.edu.cn.
    BACKGROUND: Supplementary data are available at Journal Name online.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    药物发现过程要求很高,也很耗时,越来越多的人提出基于机器学习的研究来提高效率。该领域的一个重大挑战是预测药物分子的结构是否会与靶蛋白相互作用。最近的一项研究试图通过利用分子和蛋白质结构的先验知识的编码器来解决这一挑战。显著改善了药物-靶标相互作用任务的预测性能。尽管如此,以前的研究中采用的目标编码器表现出随着输入长度二次增加的计算复杂度,从而限制了它们的实际效用。为了克服这一挑战,我们采用基于提示的学习策略来开发紧凑高效的目标编码器。使用自适应参数,我们的模型可以融合一般知识和面向目标的知识来构建蛋白质序列的特征。这种方法在三个基准数据集上产生了相当大的性能增强和提高的学习效率:BIOSNAP,戴维斯,和绑定DB。此外,我们的方法具有只需要最小视频RAM(VRAM)分配的优点,具体为7.7GB,在训练阶段(16.24%以前最先进的模型)。即使在有限的计算资源下,这也确保了训练和推理的可行性。
    The drug discovery process is demanding and time-consuming, and machine learning-based research is increasingly proposed to enhance efficiency. A significant challenge in this field is predicting whether a drug molecule\'s structure will interact with a target protein. A recent study attempted to address this challenge by utilizing an encoder that leverages prior knowledge of molecular and protein structures, resulting in notable improvements in the prediction performance of the drug-target interactions task. Nonetheless, the target encoders employed in previous studies exhibit computational complexity that increases quadratically with the input length, thereby limiting their practical utility. To overcome this challenge, we adopt a hint-based learning strategy to develop a compact and efficient target encoder. With the adaptation parameter, our model can blend general knowledge and target-oriented knowledge to build features of the protein sequences. This approach yielded considerable performance enhancements and improved learning efficiency on three benchmark datasets: BIOSNAP, DAVIS, and Binding DB. Furthermore, our methodology boasts the merit of necessitating only a minimal Video RAM (VRAM) allocation, specifically 7.7GB, during the training phase (16.24% of the previous state-of-the-art model). This ensures the feasibility of training and inference even with constrained computational resources.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号