Medical question answering

  • 文章类型: Journal Article
    大型语言模型(LLM)具有促进人工智能技术发展的潜力,可以帮助医学专家进行交互式决策支持。LLM在医学问答中获得的最先进的性能说明了这种潜力,取得了惊人的成绩,例如在许可医学考试中通过了分数。然而,虽然令人印象深刻,医疗应用所需的质量标准还远远没有实现。目前,LLM仍然受到过时的知识和产生幻觉内容的趋势的挑战。此外,大多数评估医学知识的基准缺乏参考黄金解释,这意味着不可能评估LLM预测的推理。最后,如果我们考虑对英语以外的语言进行基准测试,情况尤其严峻,据我们所知,一个完全被忽视的话题。为了解决这些缺点,在本文中,我们介绍了MedExpQA,第一个基于医学考试的多语言基准,用于评估医学问答中的LLM。据我们所知,MedExpQA首次包括参考黄金解释,医生写的,考试中正确和不正确的选项。使用黄金参考解释和检索增强生成(RAG)方法的综合多语言实验表明,LLM的性能,最好的结果是英语准确率约为75,还有很大的改进空间,尤其是英语以外的语言,精度下降10点。因此,尽管使用了最先进的RAG方法,我们的研究结果还表明,获取和整合现有医学知识的难度很高,这些知识可能对医学问答的下游评估产生积极影响.数据,代码,和微调模型将公开提供。
    Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.1.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    ChatGPT探索了提供医疗诊断的问题回答(QA)的战略蓝图,治疗建议,和其他医疗支持。这是通过自然语言处理(NLP)和多模态范例越来越多地合并医疗领域数据来实现的。通过转换文本的分布,images,视频,以及从一般领域到医学领域的其他模式,这些技术加速了医学领域问答(MDQA)的发展。它们弥合了人类自然语言与复杂的医学领域知识或专家提供的手动注释之间的差距,处理大规模,多样化,不平衡,甚至是医疗环境中的未标记数据分析场景。我们的重点是利用语言模型和多模式范式进行医学问答,旨在指导研究界为其特定的医学研究要求选择合适的机制。专门的任务,如单峰相关的问答,阅读理解,推理,诊断,关系提取,概率建模,和其他人,以及多模态相关的任务,如视觉问答,图像说明,跨模态检索,报告摘要,和一代,详细讨论。每个部分都深入研究了所考虑的各个方法的复杂细节。本文着重介绍了医学领域探索相对于一般领域方法的结构和进步,强调他们在不同任务和数据集上的应用。它还概述了未来医学领域研究的当前挑战和机遇,为这个快速发展的领域的持续创新和应用铺平了道路。这份全面的评论不仅可以作为学术资源,而且还可以为医学问题回答领域的未来探索和利用奠定基础。
    ChatGPT explores a strategic blueprint of question answering (QA) to deliver medical diagnoses, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have accelerated the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert-provided manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilization of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image captioning, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also delineates the course for future probes and utilization in the field of medical question answering.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)分析和响应自由书写文本的能力在精神病学领域引起了越来越多的兴奋;此类模型的应用为精神病学应用带来了独特的机遇和挑战。这篇综述文章旨在全面概述精神病学中的LLM,他们的模型架构,潜在的用例,和临床考虑。诸如ChatGPT/GPT-4之类的LLM框架是针对大量文本数据进行训练的,这些文本数据有时会针对特定任务进行微调。这开辟了广泛的可能的精神病学应用,例如准确预测特定疾病的个体患者风险因素,从事治疗干预,分析治疗材料,仅举几例。然而,在精神病学环境中收养会带来许多挑战,包括LLM的固有限制和偏见,对可解释性和隐私的担忧,以及产生的错误信息造成的潜在损害。这篇综述涵盖了潜在的机会和局限性,并强调了在现实世界的精神病学背景下应用这些模型时的潜在考虑因素。
    The ability of Large Language Models (LLMs) to analyze and respond to freely written text is causing increasing excitement in the field of psychiatry; the application of such models presents unique opportunities and challenges for psychiatric applications. This review article seeks to offer a comprehensive overview of LLMs in psychiatry, their model architecture, potential use cases, and clinical considerations. LLM frameworks such as ChatGPT/GPT-4 are trained on huge amounts of text data that are sometimes fine-tuned for specific tasks. This opens up a wide range of possible psychiatric applications, such as accurately predicting individual patient risk factors for specific disorders, engaging in therapeutic intervention, and analyzing therapeutic material, to name a few. However, adoption in the psychiatric setting presents many challenges, including inherent limitations and biases in LLMs, concerns about explainability and privacy, and the potential damage resulting from produced misinformation. This review covers potential opportunities and limitations and highlights potential considerations when these models are applied in a real-world psychiatric context.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Preprint
    为了提高各种医学问答(QA)任务的准确性和可靠性,并研究部署大型语言模型(LLM)技术的有效方法,我们利用最先进的LLM开发了一种新颖的集成学习管道,专注于提高不同医疗QA数据集的性能。
    我们的研究采用了三个医学QA数据集:PubMedQA,MedQA-USMLE,和MedMCQA,每个都在生物医学问答中提出了独特的挑战。拟议的LLM-Synergy框架,专注于使用LLM的零镜头案例,结合了两种主要的集成方法。第一个是基于Boosting的加权多数投票合奏,其中,通过增强算法为不同的LLM分配可变权重来加快和完善决策。第二种方法是基于集群的动态模型选择,为每个查询动态选择最合适的LLM投票,根据问题语境的特点,使用聚类方法。
    多数加权投票和动态模型选择方法在三个医疗QA数据集中与单独的LLM相比表现优异。具体来说,准确率为35.84%,96.21%,MedMCQA为37.26%,PubMedQA,和MedQA-USMLE,分别,多数加权投票。相应地,动态模型选择的精度略高,为38.01%,96.36%,38.13%。
    具有两种集成方法的LLM-Synergy框架,代表了利用LLM进行医疗QA任务的显着进步,并提供了有效利用LLMTechnologies开发的创新方式,在生物医学和健康信息学研究中,为现有和潜在的未来挑战任务定制。
    UNASSIGNED: To enhance the accuracy and reliability of diverse medical question-answering (QA) tasks and investigate efficient approaches deploying the Large Language Models (LLM) technologies, We developed a novel ensemble learning pipeline by utilizing state-of-the-art LLMs, focusing on improving performance on diverse medical QA datasets.
    UNASSIGNED: Our study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical question-answering. The proposed LLM-Synergy framework, focusing exclusively on zero-shot cases using LLMs, incorporates two primary ensemble methods. The first is a Boosting-based weighted majority vote ensemble, where decision-making is expedited and refined by assigning variable weights to different LLMs through a boosting algorithm. The second method is Cluster-based Dynamic Model Selection, which dynamically selects the most suitable LLM votes for each query, based on the characteristics of question contexts, using a clustering approach.
    UNASSIGNED: The Majority Weighted Vote and Dynamic Model Selection methods demonstrate superior performance compared to individual LLMs across three medical QA datasets. Specifically, the accuracies are 35.84%, 96.21%, and 37.26% for MedMCQA, PubMedQA, and MedQA-USMLE, respectively, with the Majority Weighted Vote. Correspondingly, the Dynamic Model Selection yields slightly higher accuracies of 38.01%, 96.36%, and 38.13%.
    UNASSIGNED: The LLM-Synergy framework with two ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks and provides an innovative way of efficiently utilizing the development with LLM Technologies, customing for both existing and potentially future challenge tasks in biomedical and health informatics research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    医学和临床问答(QA)近年来受到研究者的高度关注。尽管在这一领域取得了显著的进步,我国医学领域的发展相对落后。这可以归因于中文文本处理的难度和缺乏大规模数据集。为了弥合差距,本文介绍了一个中国医学QA数据集,并提出了有效的任务方法。
    我们首先构建了一个大规模的中国医学QA数据集。然后,我们利用深度匹配神经网络来捕获问题和答案中单词之间的语义交互。考虑到中文分词(CWS)工具可能无法识别临床术语,我们设计了一个模块来合并单词段并产生新的表示。它通过使用卷积内核来学习单词或片段的常见成分,并通过窗口池化来选择最强的信号。
    在我们的数据集上找到了流行的CWS工具中表现最好的工具。在我们的实验中,深度匹配模型的性能大大优于现有方法。结果还表明,我们提出的语义聚类表示模块在1和4.9%的平均平均精度下将模型的性能提高了5.5%的精度。
    在本文中,我们引入了一个大规模的中国医学QA数据集,并将任务转换为语义匹配问题。我们还比较了不同的CWS工具和输入单元。在两个最先进的深度匹配神经网络中,MatchPyramid表现更好。结果也表明了所提出的语义聚类表示模块的有效性。
    Medical and clinical question answering (QA) is highly concerned by researchers recently. Though there are remarkable advances in this field, the development in Chinese medical domain is relatively backward. It can be attributed to the difficulty of Chinese text processing and the lack of large-scale datasets. To bridge the gap, this paper introduces a Chinese medical QA dataset and proposes effective methods for the task.
    We first construct a large scale Chinese medical QA dataset. Then we leverage deep matching neural networks to capture semantic interaction between words in questions and answers. Considering that Chinese Word Segmentation (CWS) tools may fail to identify clinical terms, we design a module to merge the word segments and produce a new representation. It learns the common compositions of words or segments by using convolutional kernels and selects the strongest signals by windowed pooling.
    The best performer among popular CWS tools on our dataset is found. In our experiments, deep matching models substantially outperform existing methods. Results also show that our proposed semantic clustered representation module improves the performance of models by up to 5.5% Precision at 1 and 4.9% Mean Average Precision.
    In this paper, we introduce a large scale Chinese medical QA dataset and cast the task into a semantic matching problem. We also compare different CWS tools and input units. Among the two state-of-the-art deep matching neural networks, MatchPyramid performs better. Results also show the effectiveness of the proposed semantic clustered representation module.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号