biomedical NLP

  • 文章类型: Journal Article
    目标:最近,大型语言模型(LLM)在自然语言理解方面展示了卓越的能力。在展示日常对话和问答(QA)情况的熟练程度时,这些模型经常在需要精度的领域中挣扎,如医疗应用,由于他们缺乏特定领域的知识。在这篇文章中,我们描述了建造一个强大的,专为医学应用而设计的开源语言模型,称为PMC-LLaMA。
    方法:我们将通用LLM调整为医学领域,通过整合4.8M生物医学学术论文和30K医学教科书,涉及以数据为中心的知识注入,以及全面的特定领域指令微调,包括医疗QA,推理的理由,和对话对话与202M令牌。
    结果:在评估各种公共医疗QA基准和手动评级时,我们的轻量级PMC-LLaMA,仅由13B个参数组成,表现出优越的性能,甚至超过了ChatGPT.所有型号,代码,和调整指令的数据集将发布给研究界。
    结论:我们的贡献是3倍:(1)我们建立了一个面向医学领域的开源LLM。我们相信提出的PMC-LLaMA模型可以促进医学基础模型的进一步发展,作为医学训练的基本生成语言骨干;(2)我们进行彻底的消融研究,以证明每个建议组件的有效性,展示不同的训练数据和模型尺度如何影响医学LLM;(3)我们贡献了大规模,用于指令调整的综合数据集。
    结论:在本文中,我们系统地研究了建立开源医疗专用LLM的过程,PMC-LLaMA.
    OBJECTIVE: Recently, large language models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering (QA) situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this article, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
    METHODS: We adapt a general-purpose LLM toward the medical domain, involving data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive domain-specific instruction fine-tuning, encompassing medical QA, rationale for reasoning, and conversational dialogues with 202M tokens.
    RESULTS: While evaluating various public medical QA benchmarks and manual rating, our lightweight PMC-LLaMA, which consists of only 13B parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, and datasets for instruction tuning will be released to the research community.
    CONCLUSIONS: Our contributions are 3-fold: (1) we build up an open-source LLM toward the medical domain. We believe the proposed PMC-LLaMA model can promote further development of foundation models in medicine, serving as a medical trainable basic generative language backbone; (2) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component, demonstrating how different training data and model scales affect medical LLMs; (3) we contribute a large-scale, comprehensive dataset for instruction tuning.
    CONCLUSIONS: In this article, we systematically investigate the process of building up an open-source medical-specific LLM, PMC-LLaMA.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of high-quality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-word phrases can only be approximated by composition. This is problematic in biomedical text processing where technical phrases for diseases, symptoms, and drugs should be represented as single entities to capture the correct meaning. In this paper, we introduce PMCVec, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously. Evaluations performed on benchmark datasets produce significant performance gains both qualitatively and quantitatively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    去年年底,严重急性呼吸道综合症冠状病毒2(SARS-CoV-2)的出现不仅导致了2019年全球冠状病毒病(COVID-19)大流行,而且还导致了大量生物医学文献。在COVID-19开放研究数据集(CORD-19)发布后,包含超过20万篇学术文章,我们是一个多学科的数据科学家团队,临床医生,医学研究人员和软件工程师开发了一个创新的自然语言处理(NLP)平台,该平台将先进的搜索引擎与生物医学命名实体识别提取包相结合。特别是,该平台的开发是为了提取与COVID-19临床危险因素相关的信息,方法是以集群形式呈现结果以支持知识发现.在这里,我们描述发展背后的原则,模型和我们得到的结果。
    The emergence of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) late last year has not only led to the world-wide coronavirus disease 2019 (COVID-19) pandemic but also a deluge of biomedical literature. Following the release of the COVID-19 open research dataset (CORD-19) comprising over 200,000 scholarly articles, we a multi-disciplinary team of data scientists, clinicians, medical researchers and software engineers developed an innovative natural language processing (NLP) platform that combines an advanced search engine with a biomedical named entity recognition extraction package. In particular, the platform was developed to extract information relating to clinical risk factors for COVID-19 by presenting the results in a cluster format to support knowledge discovery. Here we describe the principles behind the development, the model and the results we obtained.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:语义文本相似性(STS)是一项自然语言处理(NLP)任务,涉及根据其含义为2个文本片段分配相似性得分。这项任务在临床文本领域特别困难,它通常具有专门的语言和频繁使用缩写。
    目的:我们创建了一个NLP系统来预测句子对的相似性得分,作为2019年n2c2/OHNLP关于自然语言处理临床数据挑战的临床语义文本相似性跟踪的一部分。随后,我们试图分析从我们的模型中提取的中间标记向量,同时处理一对临床句子,以识别在转换模型中构建语义相似性表示的位置和方式。
    方法:给定临床句子对,我们取几个独立微调变压器的平均预测相似性得分。在我们的模型分析中,我们研究了最终模型的损失与句子对的表面特征之间的关系,并评估了每个模型生成的标记向量的可解码性和代表性相似性。
    结果:我们的模型与地面实况相似性得分的相关性为0.87,在33支球队中排名第六(第一名得分为0.90)。在对模型损失进行详细的定性和定量分析时,当两个句子对都包含医疗处方的细节时,我们发现系统无法正确建模语义相似性,以及在明显的标记重叠的情况下,其过度预测语义相似性的普遍趋势。令牌矢量分析揭示了不同的表示策略,用于预测来自变压器(BERT)样式模型和XLNet的双向编码器表示之间的文本相似性。我们还发现,可以使用分类标记和转换模型第一层中句子对表示之间的余弦距离的组合来捕获与预测STS相关的大量信息,该转换模型未对测试集产生最佳预测。
    结论:我们设计并训练了一个系统,该系统使用最先进的NLP模型,以在新的临床STS数据集上获得非常有竞争力的结果。由于我们的方法不使用手工制作的规则,它是这项任务的强大深度学习基线。我们的主要贡献是对模型输出的详细分析以及对变压器模型学习到的启发式偏差的调查。基于这些发现,我们建议未来的改进。在我们的代表性分析中,我们探讨了随着句子的标记被连续层增强,不同的转换器模型在语义信号表示中如何收敛或发散。此分析揭示了这些“黑箱”模型如何在中间层中集成语义相似性信息,并指出了模型蒸馏和句子嵌入提取在临床NLP中应用的新研究方向。
    BACKGROUND: Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations.
    OBJECTIVE: We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models.
    METHODS: Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model\'s loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model.
    RESULTS: Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model\'s loss, we identified the system\'s failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)-style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set.
    CONCLUSIONS: We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model\'s outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these \"black box\" models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号