Large language models

大型语言模型
  • 文章类型: Journal Article
    目的:将这些预防指南与电子健康记录(EHRs)系统集成,加上个性化预防护理建议的产生,具有改善医疗保健结果的巨大潜力。我们的研究调查了使用大型语言模型(LLM)自动评估标准和风险因素的可行性,该指南用于未来对EHR医疗记录的分析。
    方法:我们注释了标准,危险因素,和美国预防服务工作组发布的成人指南中描述的预防性医疗服务,并评估了3种最新的LLM自动从指南中提取这些类别的信息。
    结果:我们在本研究中纳入了24条指南。LLM可以自动提取所有标准,危险因素,和9个指南的医疗服务。所有3个LLM在提取有关人口统计学标准或风险因素的信息方面表现良好。一些LLM在提取健康的社会决定因素方面表现更好,家族史,和预防性咨询服务比其他服务。
    结论:虽然LLM证明了处理冗长的预防性护理指南的能力,几个挑战依然存在,包括与输入令牌的最大长度和生成内容而不是严格遵守原始输入的趋势相关的约束。此外,在现实世界的临床环境中使用LLM需要仔细的伦理考虑。医疗保健专业人员必须仔细验证提取的信息,以减轻偏见,确保完整性,保持准确性。
    结论:我们开发了一种数据结构来存储注释的预防指南,并使其公开可用。采用最先进的LLM来提取预防性护理标准,危险因素,预防性护理服务为将来将这些指南纳入EHR铺平了道路。
    OBJECTIVE: The integration of these preventive guidelines with Electronic Health Records (EHRs) systems, coupled with the generation of personalized preventive care recommendations, holds significant potential for improving healthcare outcomes. Our study investigates the feasibility of using Large Language Models (LLMs) to automate the assessment criteria and risk factors from the guidelines for future analysis against medical records in EHR.
    METHODS: We annotated the criteria, risk factors, and preventive medical services described in the adult guidelines published by United States Preventive Services Taskforce and evaluated 3 state-of-the-art LLMs on extracting information in these categories from the guidelines automatically.
    RESULTS: We included 24 guidelines in this study. The LLMs can automate the extraction of all criteria, risk factors, and medical services from 9 guidelines. All 3 LLMs perform well on extracting information regarding the demographic criteria or risk factors. Some LLMs perform better on extracting the social determinants of health, family history, and preventive counseling services than the others.
    CONCLUSIONS: While LLMs demonstrate the capability to handle lengthy preventive care guidelines, several challenges persist, including constraints related to the maximum length of input tokens and the tendency to generate content rather than adhering strictly to the original input. Moreover, the utilization of LLMs in real-world clinical settings necessitates careful ethical consideration. It is imperative that healthcare professionals meticulously validate the extracted information to mitigate biases, ensure completeness, and maintain accuracy.
    CONCLUSIONS: We developed a data structure to store the annotated preventive guidelines and make it publicly available. Employing state-of-the-art LLMs to extract preventive care criteria, risk factors, and preventive care services paves the way for the future integration of these guidelines into the EHR.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在医学和生物医学教育中,传统的教学方法往往难以吸引学生,促进批判性思维。人工智能语言模型的使用有可能通过提供创新的方式来改变教学和学习实践,积极的学习方法,促进知识的好奇心和更深入的理解。为了有效地将AI语言模型集成到生物医学教育中,教育工作者必须了解这些工具的好处和局限性,以及如何利用它们来实现高水平的学习成果。本文探讨了AI语言模型在生物医学教育中的使用,注重它们在课堂教学和学习作业中的应用。使用SOLO分类法作为框架,我讨论设计问题的策略,挑战学生锻炼批判性思维和解决问题的能力,即使在AI模型的辅助下。此外,我提出了一个评分规则,用于评估与AI语言模型合作时的学生表现,确保对他们的学习成果进行全面评估。人工智能语言模型为提高学生参与度和促进生物医学领域的主动学习提供了一个有希望的机会。了解这些技术的潜在用途使教育工作者能够创造适合学生需求的学习体验。鼓励求知欲和对复杂主题的深刻理解。这些工具的应用对于将来为学生提供更有效和更具吸引力的学习体验至关重要。
    In medical and biomedical education, traditional teaching methods often struggle to engage students and promote critical thinking. The use of AI language models has the potential to transform teaching and learning practices by offering an innovative, active learning approach that promotes intellectual curiosity and deeper understanding. To effectively integrate AI language models into biomedical education, it is essential for educators to understand the benefits and limitations of these tools and how they can be employed to achieve high-level learning outcomes.This article explores the use of AI language models in biomedical education, focusing on their application in both classroom teaching and learning assignments. Using the SOLO taxonomy as a framework, I discuss strategies for designing questions that challenge students to exercise critical thinking and problem-solving skills, even when assisted by AI models. Additionally, I propose a scoring rubric for evaluating student performance when collaborating with AI language models, ensuring a comprehensive assessment of their learning outcomes.AI language models offer a promising opportunity for enhancing student engagement and promoting active learning in the biomedical field. Understanding the potential use of these technologies allows educators to create learning experiences that are fit for their students\' needs, encouraging intellectual curiosity and a deeper understanding of complex subjects. The application of these tools will be fundamental to provide more effective and engaging learning experiences for students in the future.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:临床指南,与文献一致发展,通常用于指导外科医生的临床决策。医学领域的大型语言模型和人工智能(AI)的最新进展具有令人兴奋的潜力。OpenAI的生成AI模型,被称为ChatGPT,可以快速综合信息并产生基于医学文献的反应,这可能被证明是脊柱护理临床决策的有用工具。目前的文献尚未研究ChatGPT协助退行性腰椎滑脱的临床决策的能力。
    目的:该研究旨在比较ChatGPT与北美脊柱学会(NASS)关于退行性脊椎滑脱的诊断和治疗的临床指南的建议的一致性,并在最新文献的背景下评估ChatGPT的准确性。
    方法:ChatGPT-3.5和4.0提示了NASS关于退行性脊椎滑脱诊断和治疗临床指南的问题,并将其建议分级为“一致”或“不一致”。当ChatGPT产生的建议准确地再现了NASS建议中提出的所有主要观点时,反应被认为是“一致的”。任何等级为“不一致”的答复都被进一步分为两个子类别:“不足”或“结论过高,\“提供对评分基本原理的进一步见解。使用卡方检验比较GPT-3.5和4.0之间的反应。
    结果:ChatGPT-3.5回答了符合NASS指南的28个临床问题中的13个(46.4%)。分类分类如下:定义和自然史(1/1,100%),诊断和成像(1/4,25%),医学干预和手术治疗的结果措施(0/1,0%),医疗和介入治疗(4/6,66.7%),手术治疗(7/14,50%),和脊柱护理的价值(0/2,0%)。当NASS表明有足够的证据提供明确的建议时,ChatGPT-3.5在66.7%的时间内产生一致反应(6/9)。然而,当被问及NASS没有提供明确建议的临床问题时,ChatGPT-3.5的一致性降至36.8%(7/19)。对ChatGPT-3.5与指南不一致的进一步细分显示,其绝大多数不准确的建议是由于它们“过于结论性”(12/15,80%),而不是“不足”(3/15,20%)。ChatGPT-4.0回答了与NASS指南一致的28个问题中的19个(67.9%)(P=0.177)。当NASS表明有足够的证据提供明确的建议时,ChatGPT-4.0在66.7%的时间内产生一致反应(6/9)。当询问NASS未提供明确建议的临床问题时,ChatGPT-4.0的一致性保持在68.4%(13/19,P=0.104)。
    结论:这项研究揭示了临床环境中LLM应用的双重性:在某些情况下的准确性和实用性与在其他情况下的不准确性和风险之一。ChatGPT与NASS提供的大多数临床问题一致。然而,对于NASS没有提供最佳实践的问题,ChatGPT产生的答案要么过于笼统,要么与文献不一致,甚至捏造的数据/引用。因此,临床医生在尝试咨询ChatGPT临床建议时应格外谨慎,在最近的文献中注意确保其可靠性。
    BACKGROUND: Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons\' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI\'s generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis.
    OBJECTIVE: The study aimed to compare ChatGPT\'s concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT\'s accuracy within the context of the most recent literature.
    METHODS: ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as \"concordant\" or \"nonconcordant\" relative to those put forth by NASS. A response was considered \"concordant\" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of \"nonconcordant\" were further stratified into two subcategories: \"Insufficient\" or \"Over-conclusive,\" to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests.
    RESULTS: ChatGPT-3.5 answered 13 of NASS\'s 28 total clinical questions in concordance with NASS\'s guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5\'s concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5\'s nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being \"over-conclusive\" (12/15, 80%), rather than \"insufficient\" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0\'s concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104).
    CONCLUSIONS: This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:在过去的十年中,生成人工智能(AI)工具经历了快速发展,并且作为学术写作中的辅助模型越来越受欢迎。然而,人工智能生成可靠和准确的研究文章的能力是一个争论的话题。主要科学期刊已经发布了有关AI工具在科学写作中的贡献的政策。
    方法:根据2023年SCImago排名,我们对排名前25位的心脏病学和心血管医学期刊的作者和同行评审指南进行了综述。数据是通过查看期刊网站并直接通过电子邮件发送给编辑部获得的。有关期刊特征的描述性数据在SPSS上编码。期刊指南的分组分析是根据出版公司的政策进行的。
    结果:我们的分析表明,我们研究中的所有科学期刊都允许在科学写作中使用人工智能,但根据ICMJE的建议,有一定的局限性。我们发现人工智能工具不能包含在作者中或用于图像生成,所有作者都必须对他们提交和出版的作品承担全部责任。严格禁止在同行评审过程中使用生成AI工具。
    结论:关于在科学写作中使用生成人工智能的指南是标准化的,detailed,根据国际论坛提出的建议,我们研究中的所有期刊都一致关注。必须确保认真遵循和更新这些政策,以保持科学完整性。
    BACKGROUND: Generative Artificial Intelligence (AI) tools have experienced rapid development over the last decade and are gaining increasing popularity as assistive models in academic writing. However, the ability of AI to generate reliable and accurate research articles is a topic of debate. Major scientific journals have issued policies regarding the contribution of AI tools in scientific writing.
    METHODS: We conducted a review of the author and peer reviewer guidelines of the top 25 Cardiology and Cardiovascular Medicine journals as per the 2023 SCImago rankings. Data were obtained though reviewing journal websites and directly emailing the editorial office. Descriptive data regarding journal characteristics were coded on SPSS. Subgroup analyses of the journal guidelines were conducted based on the publishing company policies.
    RESULTS: Our analysis revealed that all scientific journals in our study permitted the documented use of AI in scientific writing with certain limitations as per ICMJE recommendations. We found that AI tools cannot be included in the authorship or be used for image generation, and that all authors are required to assume full responsibility of their submitted and published work. The use of generative AI tools in the peer review process is strictly prohibited.
    CONCLUSIONS: Guidelines regarding the use of generative AI in scientific writing are standardized, detailed, and unanimously followed by all journals in our study according to the recommendations set forth by international forums. It is imperative to ensure that these policies are carefully followed and updated to maintain scientific integrity.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:评估GPT-3.5,GPT-4和微调的GPT-3.5模型在将FleischnerSociety建议应用于肺结节的准确性。方法:我们根据Fleischner协会指南,为12个结节类别中的每一个生成了10个肺结节描述,将它们纳入一个虚构的报告(n=120)。GPT-3.5和GPT-4被提示根据报告提出后续建议。然后,我们将完整的指南纳入提示中并重新提交。最后,我们将提示重新提交给经过微调的GPT-3.5模型。结果使用R中的二元准确性分析进行分析。结果:GPT-3.5应用Fleischner协会指南的准确性为0.058(95%CI:0.02,0.12)。GPT-4的准确度提高到0.15(95%CI:0.09,0.23;准确度比较P=.02)。在推荐PET-CT和/或活检时,GPT-3.5和GPT-4的F评分均为0.00.在提示中明确包括Fleischner协会准则之后,GPT-3.5和GPT-4的准确性显著提高到0.42(95%CI:0.33,0.51;P<.001)和0.66(95%CI:0.57,0.74;P<.001),分别。GPT-4仍显著优于GPT-3.5(P<.001)。微调的GPT-3.5模型精度为0.46(95%CI:0.37,0.55),与包含指南的GPT-3.5模型没有不同(P=.53)。结论:GPT-3.5和GPT-4在应用广为人知的指南方面表现不佳,从未正确推荐活检。错误的知识和推理都导致了他们的糟糕表现。虽然GPT-4比GPT-3.5更准确,但其不准确率在临床实践中是不可接受的。这些结果强调了大型语言模型对于基于知识和推理的任务的局限性。
    Purpose: To evaluate the accuracy of GPT-3.5, GPT-4, and a fine-tuned GPT-3.5 model in applying Fleischner Society recommendations to lung nodules. Methods: We generated 10 lung nodule descriptions for each of the 12 nodule categories from the Fleischner Society guidelines, incorporating them into a single fictitious report (n = 120). GPT-3.5 and GPT-4 were prompted to make follow-up recommendations based on the reports. We then incorporated the full guidelines into the prompts and re-submitted them. Finally, we re-submitted the prompts to a fine-tuned GPT-3.5 model. Results were analyzed using binary accuracy analysis in R. Results: GPT-3.5 accuracy in applying Fleischner Society guidelines was 0.058 (95% CI: 0.02, 0.12). GPT-4 accuracy was improved at 0.15 (95% CI: 0.09, 0.23; P = .02 for accuracy comparison). In recommending PET-CT and/or biopsy, both GPT-3.5 and GPT-4 had an F-score of 0.00. After explicitly including the Fleischner Society guidelines in the prompt, GPT-3.5 and GPT-4 significantly improved their accuracy to 0.42 (95% CI: 0.33, 0.51; P < .001) and to 0.66 (95% CI: 0.57, 0.74; P < .001), respectively. GPT-4 remained significantly better than GPT-3.5 (P < .001). The fine-tuned GPT-3.5 model accuracy was 0.46 (95% CI: 0.37, 0.55), not different from the GPT-3.5 model with guidelines included (P = .53). Conclusion: GPT-3.5 and GPT-4 performed poorly in applying widely known guidelines and never correctly recommended biopsy. Flawed knowledge and reasoning both contributed to their poor performance. While GPT-4 was more accurate than GPT-3.5, its inaccuracy rate was unacceptable for clinical practice. These results underscore the limitations of large language models for knowledge and reasoning-based tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Observational Study
    背景:为了帮助医疗保健提供者解释指南,临床问题(CQ)通常包括在内,但并非总是如此,这可能会使非专家临床医生难以解释。我们评估了ChatGPT准确回答日本高血压学会高血压管理指南(JSH2019)CQ的能力。方法和结果:我们使用JSH2019的数据进行了一项观察性研究。评估了CQs的准确率和指南的有限循证问题(Qs)。ChatGPT对CQ的准确率高于Qs(80%与36%,P值:0.005)。
    结论:ChatGPT有可能成为临床医生管理高血压的有价值的工具。
    To assist healthcare providers in interpreting guidelines, clinical questions (CQ) are often included, but not always, which can make interpretation difficult for non-expert clinicians. We evaluated the ability of ChatGPT to accurately answer CQs on the Japanese Society of Hypertension Guidelines for the Management of Hypertension (JSH 2019).Methods and Results: We conducted an observational study using data from JSH 2019. The accuracy rate for CQs and limited evidence-based questions of the guidelines (Qs) were evaluated. ChatGPT demonstrated a higher accuracy rate for CQs than for Qs (80% vs. 36%, P value: 0.005).
    ChatGPT has the potential to be a valuable tool for clinicians in the management of hypertension.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号