目标:像OpenAI的ChatGPT这样的大型语言模型(LLM)是强大的生成系统,可以快速合成自然语言响应。对LLM的研究揭示了它们的潜力和陷阱,尤其是在临床环境中。然而,医学LLM研究的不断发展的景观在他们的评估方面留下了几个空白,应用程序,和证据基础。
目的:本范围综述旨在(1)总结当前有关LLM在医学应用中的准确性和有效性的研究证据,(2)商量伦理,legal,后勤,以及LLM在临床环境中使用的社会经济意义,(3)探索医疗保健中LLM实施的障碍和促进者,(4)提出一个评估LLM临床效用的标准化评估框架,(5)确定证据空白,并提出未来LLM在临床应用中的研究方向。
方法:我们从MEDLINE筛选了4,036条记录,EMBASE,CINAHL,medRxiv,bioRxiv,和arXiv从2023年1月(搜索开始)到2023年6月26日的英文论文,并分析了55项全球研究的结果。根据牛津循证医学中心的建议报告证据质量。
结果:我们的结果表明,LLM在编制患者笔记方面显示出希望,协助患者在医疗保健系统中导航,在某种程度上,当与人类监督相结合时,支持临床决策。然而,它们的利用受到可能伤害患者的训练数据偏见的限制,产生不准确但令人信服的信息,和道德,legal,社会经济,和隐私问题。我们还发现缺乏评估LLM有效性和可行性的标准化方法。
结论:因此,这篇综述强调了解决这些局限性的潜在未来方向和问题,并进一步探索LLM在增强医疗保健服务方面的潜力。
结论:问题大型语言模型(LLM)在临床环境中的应用现状如何?以及与它们的整合相关的主要挑战和机遇是什么?分析55项研究,表示当LLM,包括OpenAI的ChatGPT,在编制病人笔记方面显示出潜力,协助医疗保健导航,并支持临床决策,它们的使用受到数据偏见的限制,产生看似合理但不正确的信息,以及各种道德和隐私问题。研究的严谨性有很大差异,尤其是在评估LLM响应时,呼吁标准化的评估方法,包括既定的指标,如ROUGE,METEOR,G-Eval,和MultiMedQA。意义研究结果表明,在LLM研究中需要增强的方法,强调整合真实患者数据和考虑健康的社会决定因素的重要性,提高LLM在临床环境中的适用性和安全性。
OBJECTIVE: Large language models (LLMs) like OpenAI\'s ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base.
OBJECTIVE: This scoping
review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs\' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications.
METHODS: We screened 4,036 records from
MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations.
RESULTS: Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs\' effectiveness and feasibility.
CONCLUSIONS: This
review thus highlights potential future directions and questions to address these limitations and to further explore LLMs\' potential in enhancing healthcare delivery.
CONCLUSIONS: Question What is the current state of Large Language Models’ (LLMs) application in clinical settings, and what are the primary challenges and opportunities associated with their integration? Findings This scoping
review, analyzing 55 studies, indicates that while LLMs, including OpenAI’s ChatGPT, show potential in compiling patient notes, aiding in healthcare navigation, and supporting clinical decision-making, their use is constrained by data biases, the generation of plausible but incorrect information, and various ethical and privacy concerns. A significant variability in the rigor of studies, especially in evaluating LLM responses, calls for standardized evaluation methods, including established metrics like ROUGE, METEOR, G-Eval, and MultiMedQA. Meaning The findings suggest a need for enhanced methodologies in LLM research, stressing the importance of integrating real patient data and considering social determinants of health, to improve the applicability and safety of LLMs in clinical environments.