Large language model

大型语言模型
  • 文章类型: Journal Article
    背景:风湿性疾病的复杂性给临床医生制定个性化治疗计划带来了相当大的挑战。诸如ChatGPT之类的大型语言模型(LLM)可以实现治疗决策支持。
    目的:将ChatGPT-3.5和GPT-4制定的治疗计划与临床风湿病委员会(RB)制定的治疗计划进行比较。
    方法:制作虚构患者小插图,并查询GPT-3.5、GPT-4和RB,以提供各自的一线和二线治疗计划以及潜在理由。四位来自不同中心的风湿病专家,对治疗计划的起源视而不见,选择总体首选治疗方案并评估治疗计划的安全性,EULAR指南遵守情况,医疗充分性,整体质量,使用5点Likert量表证明治疗计划及其完整性以及患者小插图困难。
    结果:收集了20个虚构的插图,涵盖了各种风湿性疾病和不同的难度水平,总共评估了160个等级。在68.8%(110/160)的病例中,评估者更喜欢RB的治疗计划,而不是GPT-4(16.3%;26/160)和GPT-3.5(15.0%;24/160)。与GPT-3.5相比,GPT-4的计划更频繁地选择用于一线治疗。RB和GPT-4的一线治疗计划之间没有观察到显著的安全性差异。风湿病学家的计划在指南依从性方面获得了更高的评分,医疗适宜性,完整性和整体质量。评分与小插图难度无关。LLM生成的计划明显更长,更详细。
    结论:GPT-4和GPT-3.5产生了安全的,风湿性疾病的高质量治疗计划,在临床决策支持中展示希望。未来的研究应该调查详细的标准化提示和LLM使用对临床决策的影响。
    BACKGROUND: The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.
    OBJECTIVE: To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).
    METHODS: Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans\' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.
    RESULTS: 20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB\'s treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4\'s plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4\'s first-line treatment plans. Rheumatologists\' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.
    CONCLUSIONS: GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    教医学生获得所需的技能,解释,apply,沟通临床信息是医学教育不可或缺的一部分。此过程的一个关键方面涉及为学生提供有关其自由文本临床笔记质量的反馈。
    本研究的目标是评估大型语言模型ChatGPT3.5的能力,对医学生的自由文本历史和身体笔记进行评分。
    这是一个单一的机构,回顾性研究。标准化的患者学到了预先指定的临床病例,作为病人,与医学生互动。每个学生都写了自由文本历史和他们互动的物理笔记。学生的笔记由标准化患者和ChatGPT使用由85个案例元素组成的预先指定的评分规则进行独立评分。准确度的度量是正确的百分比。
    研究人群由168名一年级医学生组成。总共有14,280分。ChatGPT错误得分率为1.0%,标准化患者错误评分率为7.2%。ChatGPT错误率为86%,低于标准化患者错误率。ChatGPT平均不正确得分为12(SD11)显着低于标准化患者平均不正确得分为85(SD74;P=0.002)。
    与标准化患者相比,ChatGPT显示出较低的错误率。这是第一项评估生成预训练变压器(GPT)计划对医学生的标准化基于患者的免费文本临床笔记进行评分的能力的研究。预计,在不久的将来,大型语言模型将为执业医师提供有关其自由文本注释的实时反馈。GPT人工智能程序代表了医学教育和医学实践的重要进步。
    UNASSIGNED: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes.
    UNASSIGNED: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students\' free-text history and physical notes.
    UNASSIGNED: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students\' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct.
    UNASSIGNED: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002).
    UNASSIGNED: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students\' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    要评估四种大型语言模型(LLM)的性能-GPT-4,PaLM2,Qwen,和百川2-对中国患者关于干眼症(DED)的询问做出回应。
    两阶段研究,包括第一阶段的横截面测试和第二阶段的真实世界临床评估。
    8名获得董事会认证的眼科医生和46名DED患者。
    聊天机器人“对中国患者的反应”对DED的询问进行了评估。在第一阶段,六位资深眼科医生使用5点Likert量表在五个领域对聊天机器人的回答进行主观评价:正确性,完整性,可读性,乐于助人,和安全。使用中文可读性分析平台进行客观可读性分析。在第二阶段,46名DED代表性患者询问了在第一阶段问题中表现最佳的两种语言模型(GPT-4和百川2),然后对答案的满意度和可读性进行了评分。然后,两名高级眼科医生评估了五个领域的反应。
    五个领域的主观得分和第一阶段的客观可读性得分。患者满意度,可读性分数,以及第二阶段五个领域的主观得分。
    在第一阶段,GPT-4在五个领域表现出优异的性能(正确性:4.47;完整性:4.39;可读性:4.47;有用性:4.49;安全性:4.47,p<0.05)。然而,可读性分析表明,GPT-4的反应是高度复杂的,平均得分为12.86(p<0.05),而Qwen的得分为10.87、11.53和11.26,分别为百川2和PaLM2。在第二阶段,如五个领域的分数所示,GPT-4和百川2均擅长回答DED患者提出的问题。然而,百川2的回答的完整性相对较差(4.04与GPT-4为4.48,p<0.05)。然而,百川2的建议比GPT-4的建议更容易理解(患者可读性:3.91vs.4.61,p<0.05;眼科医生可读性:2.67vs.4.33).
    这些发现强调了法学硕士的潜力,特别是GPT-4和百川2,对中国患者关于DED的问题提供准确和全面的回答。
    UNASSIGNED: To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED).
    UNASSIGNED: Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase.
    UNASSIGNED: Eight board-certified ophthalmologists and 46 patients with DED.
    UNASSIGNED: The chatbots\' responses to Chinese patients\' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots\' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains.
    UNASSIGNED: Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase.
    UNASSIGNED: In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, p < 0.05). However, the readability analysis revealed that GPT-4\'s responses were highly complex, with an average score of 12.86 (p < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2\'s responses was relatively poor (4.04 vs. 4.48 for GPT-4, p < 0.05). Nevertheless, Baichuan 2\'s recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, p < 0.05; ophthalmologist readability: 2.67 vs. 4.33).
    UNASSIGNED: The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在美国,五分之一的成年人目前是患有严重疾病或残疾的个人的家庭照顾者。与专业护理人员不同,家庭照顾者通常在没有正式准备或培训的情况下承担这一角色。因此,迫切需要提高家庭护理人员提供优质护理的能力。利用技术作为教育工具或辅助护理是一种有前途的方法,有可能提高家庭护理人员的学习和护理能力。大型语言模型(LLM)可以用作支持护理人员的基础技术。LLM可以归类为基础模型(FM),它是在广泛的数据集上训练的大规模模型,可以适应一系列不同的领域任务。尽管有潜力,FM有“幻觉”的关键弱点,“模型产生的信息可能具有误导性或不准确。当语言模型被部署为护理人员的一线帮助工具时,信息可靠性至关重要。
    目的:本研究旨在(1)通过使用FM和护理知识库来开发可靠的护理语言模型(CaLM),(2)使用需要更少的计算资源的小型FM开发可访问的CaLM,(3)与大型调频相比,评估模型的性能。
    方法:我们使用检索增强生成(RAG)框架结合FM微调开发了一种CaLM,通过将模型基于护理知识库来提高FM答案的质量。CaLM的关键组成部分是护理知识库,微调调频,和一个回收模块。我们使用2个小型FM作为CaLM(LLaMA[大型语言模型MetaAI]2和Falcon,具有70亿个参数)的基础,并采用了大型FM(GPT-3.5,估计有1750亿个参数)作为基准。我们通过从互联网上收集各种类型的文档来开发护理知识库。我们专注于阿尔茨海默病和相关痴呆症患者的护理人员。我们使用通常用于评估语言模型的基准指标及其可靠性来评估模型的性能,以提供准确的答案参考。
    结果:RAG框架提高了本研究中使用的所有FM在所有措施中的性能。不出所料,在所有指标上,大型FM的表现都优于小型FM。有趣的是,在所有指标中,使用RAG的小型微调FM的表现明显优于GPT3.5。具有小FM的微调LLaMA2在返回带有答案的参考方面比GPT3.5(即使使用RAG)表现更好。
    结论:研究表明,可以使用具有特定于护理领域的知识库的小型FM开发可靠且可访问的CaLM。
    BACKGROUND: In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of \"hallucination,\" where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers.
    OBJECTIVE: This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model\'s performance compared with a large FM.
    METHODS: We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models\' performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers.
    RESULTS: The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers.
    CONCLUSIONS: The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究的目的是评估护士对ChatGPT的认识和使用。这项研究于2023年10月进行,对华西医院护理教育计划的124名护士进行了在线问卷调查。问卷包括参与者的人口统计信息,ChatGPT的意识,以及使用它的实际经验。共有57.3%(71/124)的护士完成了调查。其中,56.3%(40/71)知道ChatGPT,43.7%(31/71)不知道ChatGPT。在使用方面,在使用ChatGPT的20人中,13用于学习,10为论文写作,五个用于研究,两个用于聊天。这项研究强调了ChatGPT在提高护士专业能力和有效性方面的潜力。进一步的研究将集中在如何更有效地使用ChatGPT来支持护士的专业发展和成长。
    The aim of this study was to assess nurses\' awareness and use of ChatGPT. The study was conducted in October 2023 with an online questionnaire for 124 nurses in the nursing education programme at West China Hospital. The questionnaire included participants\' demographic information, awareness of ChatGPT, and actual experience of using it. A total of 57.3% (71/124) of the nurses completed the survey. Of these, 56.3% (40/71) were aware of ChatGPT and 43.7% (31/71) were not aware of ChatGPT. In terms of use, of the 20 who used ChatGPT, 13 used it for studying, 10 for essay writing, five for research and two for chatting. This study highlights the potential of ChatGPT to improve nurses\' professional competence and effectiveness. Further research will focus on how ChatGPT can be used more effectively to support nurses\' professional development and growth.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:出院通知书是专家和初级保健提供者之间护理连续性的关键组成部分。然而,这些信件写起来很费时,与直接临床护理相比,优先级不高,并且经常被指派给初级医生。先前评估住院患者出院摘要质量的研究表明,许多领域都存在不足。大型语言模型(如GPT)能够汇总大量非结构化自由文本(如电子医疗记录),并有可能自动执行此类任务。节省时间和质量的一致性。
    目的:本研究的目的是评估GPT-4在生成泌尿外科专科门诊写给初级保健提供者的出院信函方面的表现,并将其质量与初级临床医生写的信函进行比较。
    方法:医生模拟5例泌尿外科门诊常见病例并进行长期随访,编写虚构的电子记录。记录包括模拟咨询笔记,推荐信和回复,以及住院患者的相关出院摘要。GPT-4的任务是为这些病例写出院信,指定目标受众的初级保健提供者将继续患者的护理。为了安全起见,写了提示,内容,和风格。同时,向初级临床医生提供相同的病例记录和指导提示.针对幻觉的情况评估GPT-4输出。然后,由初级保健医生组成的盲人小组使用标准化的问卷工具评估了这些字母。
    结果:GPT-4在信息提供方面优于人类对应物(平均4.32,SD0.95vs3.70,SD1.27;P=.03),并且没有幻觉的实例。平均清晰度无统计学差异(4.16,SD0.95vs3.68,SD1.24;P=.12),合议制(4.36,SD1.00vs3.84,SD1.22;P=0.05),简洁性(3.60,SD1.12vs3.64,SD1.27;P=.71),后续建议(4.16,SD1.03vs3.72,SD1.13;P=.08),GPT-4和人类产生的字母之间的总体满意度(3.96,SD1.14vs3.62,SD1.34;P=.36),分别。
    结论:GPT-4写的出院信的质量与初级临床医生写的相同,没有任何幻觉。这项研究提供了一个概念证明,即大型语言模型可以成为临床文档中有用且安全的工具。
    BACKGROUND: Discharge letters are a critical component in the continuity of care between specialists and primary care providers. However, these letters are time-consuming to write, underprioritized in comparison to direct clinical care, and are often tasked to junior doctors. Prior studies assessing the quality of discharge summaries written for inpatient hospital admissions show inadequacies in many domains. Large language models such as GPT have the ability to summarize large volumes of unstructured free text such as electronic medical records and have the potential to automate such tasks, providing time savings and consistency in quality.
    OBJECTIVE: The aim of this study was to assess the performance of GPT-4 in generating discharge letters written from urology specialist outpatient clinics to primary care providers and to compare their quality against letters written by junior clinicians.
    METHODS: Fictional electronic records were written by physicians simulating 5 common urology outpatient cases with long-term follow-up. Records comprised simulated consultation notes, referral letters and replies, and relevant discharge summaries from inpatient admissions. GPT-4 was tasked to write discharge letters for these cases with a specified target audience of primary care providers who would be continuing the patient\'s care. Prompts were written for safety, content, and style. Concurrently, junior clinicians were provided with the same case records and instructional prompts. GPT-4 output was assessed for instances of hallucination. A blinded panel of primary care physicians then evaluated the letters using a standardized questionnaire tool.
    RESULTS: GPT-4 outperformed human counterparts in information provision (mean 4.32, SD 0.95 vs 3.70, SD 1.27; P=.03) and had no instances of hallucination. There were no statistically significant differences in the mean clarity (4.16, SD 0.95 vs 3.68, SD 1.24; P=.12), collegiality (4.36, SD 1.00 vs 3.84, SD 1.22; P=.05), conciseness (3.60, SD 1.12 vs 3.64, SD 1.27; P=.71), follow-up recommendations (4.16, SD 1.03 vs 3.72, SD 1.13; P=.08), and overall satisfaction (3.96, SD 1.14 vs 3.62, SD 1.34; P=.36) between the letters generated by GPT-4 and humans, respectively.
    CONCLUSIONS: Discharge letters written by GPT-4 had equivalent quality to those written by junior clinicians, without any hallucinations. This study provides a proof of concept that large language models can be useful and safe tools in clinical documentation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型显示出改善放射学工作流程的希望,但是它们在结构化放射任务(例如报告和数据系统(RADS)分类)上的表现仍未得到探索。
    目的:本研究旨在评估3个大型语言模型聊天机器人-Claude-2、GPT-3.5和GPT-4-在放射学报告中分配RADS类别并评估不同提示策略的影响。
    方法:这项横断面研究使用30个放射学报告(每个RADS标准10个)比较了3个聊天机器人,使用3级提示策略:零射,几枪,和指南PDF信息提示。这些病例的基础是2018年肝脏影像学报告和数据系统(LI-RADS),2022年肺部CT(计算机断层扫描)筛查报告和数据系统(Lung-RADS)和卵巢附件报告和数据系统(O-RADS)磁共振成像,由董事会认证的放射科医生精心准备。每份报告都进行了6次评估。两名失明的评论者评估了聊天机器人在患者级RADS分类和总体评级方面的反应。使用Fleissκ评估了跨重复的协议。
    结果:克劳德-2在总体评分中获得了最高的准确性,其中少量提示和指南PDF(提示-2),在6次运行中获得57%(17/30)的平均准确率,在k-pass投票中获得50%(15/30)的准确率。没有及时的工程,所有聊天机器人都表现不佳。结构化示例提示(prompt-1)的引入提高了所有聊天机器人整体评分的准确性。提供prompt-2进一步改进了Claude-2的性能,GPT-4未复制的增强。TheinterrunagreementwassubstantialforClaude-2(k=0.66foroverallratingandk=0.69forRADScategorization),对于GPT-4来说是公平的(两者的k=0.39),对于GPT-3.5来说是公平的(总体评分k=0.21,RADS分类k=0.39)。与Lung-RADS版本2022和O-RADS相比,2018年的所有聊天机器人均显示出更高的准确性(P<0.05);在2018年LI-RADS版本中,使用prompt-2,Claude-2实现了75%(45/60)的最高总体评分准确性。
    结论:当配备结构化提示和指导PDF时,Claude-2显示了根据既定标准(如LI-RADS版本2018)将RADS类别分配给放射学病例的潜力。然而,当前一代的聊天机器人滞后于根据最新的RADS标准对案件进行准确分类。
    BACKGROUND: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.
    OBJECTIVE: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.
    METHODS: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots\' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.
    RESULTS: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2\'s performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.
    CONCLUSIONS: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究的目的是评估ChatGPT产生的信息对中国居民教育的效用。
    我们设计了一项三步调查,以评估ChatGPT在中国住院医师培训教育中的表现,包括住院医师期末考试问题,患者病例,和居民满意度得分。首先,在ChatGPT的界面中输入了来自住院医师期末考试的204个问题,以获得正确答案的百分比。接下来,ChatGPT被要求产生20个临床病例,随后由三名讲师使用预先设计的5分Likert量表进行评估。根据包括清晰度在内的标准评估案件的质量,相关性,逻辑性,信誉,和全面性。最后,进行了31名三年级居民和ChatGPT之间的互动会议。居民对ChatGPT反馈的看法是使用李克特量表进行评估的,专注于易用性等方面,回答的准确性和完整性,及其在增强对医学知识的理解方面的有效性。
    我们的结果显示ChatGPT-3.5正确回答了45.1%的考试问题。在虚拟病人病例中,ChatGPT的平均评分为4.57±0.50、4.68±0.47、4.77±0.46、4.60±0.53和3.95±0.59分,相关性,逻辑性,信誉,和临床指导员的全面性,分别。在培训住院医师中,ChatGPT得分为4.48±0.70、4.00±0.82和4.61±0.50分,便于使用,准确性和完整性,和有用性,分别。
    我们的研究结果证明了ChatGPT在个性化中国医学教育方面的巨大潜力。
    UNASSIGNED: The purpose of this study was to assess the utility of information generated by ChatGPT for residency education in China.
    UNASSIGNED: We designed a three-step survey to evaluate the performance of ChatGPT in China\'s residency training education including residency final examination questions, patient cases, and resident satisfaction scores. First, 204 questions from the residency final exam were input into ChatGPT\'s interface to obtain the percentage of correct answers. Next, ChatGPT was asked to generate 20 clinical cases, which were subsequently evaluated by three instructors using a pre-designed Likert scale with 5 points. The quality of the cases was assessed based on criteria including clarity, relevance, logicality, credibility, and comprehensiveness. Finally, interaction sessions between 31 third-year residents and ChatGPT were conducted. Residents\' perceptions of ChatGPT\'s feedback were assessed using a Likert scale, focusing on aspects such as ease of use, accuracy and completeness of responses, and its effectiveness in enhancing understanding of medical knowledge.
    UNASSIGNED: Our results showed ChatGPT-3.5 correctly answered 45.1% of exam questions. In the virtual patient cases, ChatGPT received mean ratings of 4.57 ± 0.50, 4.68 ± 0.47, 4.77 ± 0.46, 4.60 ± 0.53, and 3.95 ± 0.59 points for clarity, relevance, logicality, credibility, and comprehensiveness from clinical instructors, respectively. Among training residents, ChatGPT scored 4.48 ± 0.70, 4.00 ± 0.82 and 4.61 ± 0.50 points for ease of use, accuracy and completeness, and usefulness, respectively.
    UNASSIGNED: Our findings demonstrate ChatGPT\'s immense potential for personalized Chinese medical education.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:患者发现技术工具更容易获取敏感的健康相关信息,如生殖健康信息。人工智能(AI)聊天机器人的创造性对话能力,比如ChatGPT,为患者提供了一种潜在的方法,可以在线有效地找到与健康相关的问题的答案。
    目的:进行了一项初步研究,将新型ChatGPT与现有的Google搜索技术进行比较,有效,以及关于在错过口服避孕药(OCP)剂量后继续行动的最新信息。
    方法:十一个问题的序列,模仿患者在错过一定剂量的OCP后询问要采取的行动,作为级联输入到ChatGPT中,考虑到ChatGPT的会话能力。这些问题被输入到四个不同的ChatGPT帐户中,帐户持有人具有各种人口统计特征,评估给予不同账户持有人的答复中的潜在差异和偏见。最主要的问题,“如果我错过了一天的口服避孕药,我该怎么办?”然后将其单独输入到Google搜索中,考虑到它的非对话性质。ChatGPT问题的结果和Google搜索结果对主要问题的可读性进行了评估,准确度,和有效的信息传递。
    结果:ChatGPT结果被确定为整体较高年级阅读水平,更长的读取持续时间(表2),不太准确,较小的电流,和一个不太有效的信息传递。相比之下,谷歌搜索结果答案框和片段处于较低的阅读水平,较短的阅读持续时间,电流更大,能够参考信息的来源(透明),并提供了除文本之外的各种格式的信息。
    结论:ChatGPT在准确性方面还有改进的空间,透明度,最近,和可靠性之前,它可以公平地实施到医疗保健信息交付,并提供潜在的好处,它带来。然而,AI可以用作提供者优先教育患者的工具,创造性,和有效的方法,例如使用AI从医疗保健提供者审查的信息中生成可访问的短教育视频。需要代表不同用户群的更大研究。
    背景:
    BACKGROUND: Patients find technology tools to be more approachable for seeking sensitive health-related information, such as reproductive health information. The inventive conversational ability of artificial intelligence (AI) chatbots, such as ChatGPT (OpenAI Inc), offers a potential means for patients to effectively locate answers to their health-related questions digitally.
    OBJECTIVE: A pilot study was conducted to compare the novel ChatGPT with the existing Google Search technology for their ability to offer accurate, effective, and current information regarding proceeding action after missing a dose of oral contraceptive pill.
    METHODS: A sequence of 11 questions, mimicking a patient inquiring about the action to take after missing a dose of an oral contraceptive pill, were input into ChatGPT as a cascade, given the conversational ability of ChatGPT. The questions were input into 4 different ChatGPT accounts, with the account holders being of various demographics, to evaluate potential differences and biases in the responses given to different account holders. The leading question, \"what should I do if I missed a day of my oral contraception birth control?\" alone was then input into Google Search, given its nonconversational nature. The results from the ChatGPT questions and the Google Search results for the leading question were evaluated on their readability, accuracy, and effective delivery of information.
    RESULTS: The ChatGPT results were determined to be at an overall higher-grade reading level, with a longer reading duration, less accurate, less current, and with a less effective delivery of information. In contrast, the Google Search resulting answer box and snippets were at a lower-grade reading level, shorter reading duration, more current, able to reference the origin of the information (transparent), and provided the information in various formats in addition to text.
    CONCLUSIONS: ChatGPT has room for improvement in accuracy, transparency, recency, and reliability before it can equitably be implemented into health care information delivery and provide the potential benefits it poses. However, AI may be used as a tool for providers to educate their patients in preferred, creative, and efficient ways, such as using AI to generate accessible short educational videos from health care provider-vetted information. Larger studies representing a diverse group of users are needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:人工智能(AI)聊天机器人的潜力,特别是ChatGPT与GPT-4(OpenAI),协助医学诊断是一个新兴的研究领域。然而,目前尚不清楚AI聊天机器人如何评估最终诊断是否包含在鉴别诊断列表中。
    目的:本研究旨在评估GPT-4在鉴别诊断列表中确定最终诊断的能力,并将其与医生的表现进行病例报告系列比较。
    方法:我们在美国病例报告杂志上使用了病例报告的鉴别诊断列表数据库,与最终诊断相对应。这些列表由3个AI系统生成:GPT-4,GoogleBard(当前为GoogleGemini),和基于MetaAI2的大型语言模型(LLaMA2)。主要结果集中在GPT-4的评估是否在这些列表中确定了最终诊断。这些AI都没有接受过额外的医疗培训或强化。为了比较,两名独立医生还评估了名单,与其他医生解决的任何不一致。
    结果:3个AI从392个病例描述中总共产生了1176个鉴别诊断列表。GPT-4的评估与1176份清单中的966份医生的评估一致(82.1%)。科恩κ系数为0.63(95%CI0.56-0.69),表明GPT-4和医生的评估之间有一个公平到良好的协议。
    结论:GPT-4在从鉴别诊断列表中确定最终诊断方面表现出相当好的一致性,与病例报告系列的医生相当。它能够将鉴别诊断列表与最终诊断进行比较,这表明它有可能通过诊断反馈来帮助临床决策支持。虽然GPT-4在评估方面表现出公平到良好的一致性,其在现实场景中的应用以及在不同临床环境中的进一步验证对于充分了解其在诊断过程中的效用至关重要.
    BACKGROUND: The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists.
    OBJECTIVE: This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series.
    METHODS: We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4\'s evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.
    RESULTS: The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4\'s evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians\' evaluations.
    CONCLUSIONS: GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号