ChatGPT-4

ChatGPT - 4
  • 文章类型: Journal Article
    背景:评估人工智能生成的医疗案例的准确性和教育效用,特别是由ChatGPT-4(由OpenAI开发)等大型语言模型生成的模型,是至关重要的,但未被充分开发。
    目的:本研究旨在评估ChatGPT-4生成的临床小插曲的教育效用及其在教育环境中的适用性。
    方法:使用收敛混合方法设计,2024年1月8日至28日进行了一项基于网络的调查,以评估ChatGPT-4在日语中产生的18例医疗病例.在调查中,使用6个主要问题项目来评估生成的临床小插曲的质量及其教育效用,这是信息质量,信息准确性,教育有用性,临床匹配,术语准确性(TA),和诊断困难。反馈是由专门从事普通内科或普通医学并且在医学教育方面经验丰富的医生征求的。进行卡方检验和Mann-WhitneyU检验以确定病例之间的差异,线性回归用于检查与医师经验相关的趋势。对定性反馈进行了主题分析,以确定需要改进的地方并确认案例的教育效用。
    结果:在邀请的73名参与者中,71(97%)回答。受访者,主要是男性(64/71,90%),跨越广泛的实践年(从1976年到2017年),并代表了日本各地不同的医院规模。大多数人认为信息质量(平均0.77,95%CI0.75-0.79)和信息准确性(平均0.68,95%CI0.65-0.71)令人满意,这些响应基于二进制数据。教育有用性的平均分数为3.55(95%CI3.49-3.60),临床匹配为3.70(95%CI3.65-3.75),TA的3.49(95%CI3.44-3.55),诊断难度为2.34(95%CI2.28-2.40),基于5分的李克特量表。统计学分析显示,不同病例的内容质量和相关性存在显著差异(Bonferroni校正后P<.001)。参与者建议改善身体发现,使用自然语言,增强医学TA。专题分析强调需要更清晰的文件,临床信息一致性,内容相关性,和以病人为中心的病例介绍。
    结论:ChatGPT-4生成的日语医学案例作为医学教育资源具有相当大的潜力,在质量和准确性方面具有公认的充分性。然而,有一个显著的需要,以提高精度和真实性的情况下的细节。本研究强调了ChatGPT-4作为医学领域辅助教育工具的价值,需要专家监督才能实现最佳应用。
    BACKGROUND: Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored.
    OBJECTIVE: This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings.
    METHODS: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians\' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases.
    RESULTS: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations.
    CONCLUSIONS: ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4\'s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:人工智能(AI)的发展对各个部门产生了重大影响,医疗保健见证了一些最具开创性的贡献。当代模特,例如ChatGPT-4和MicrosoftBing,展示了不仅仅是生成文本的能力,帮助复杂的任务,如文献搜索和完善基于Web的查询。
    目的:本研究探讨了一个令人信服的问题:AI能否独立撰写学术论文?我们的评估关注四个核心维度:相关性(确保AI的响应直接针对提示),准确性(以确定人工智能的信息在事实上是正确的和当前的),清晰度(检查人工智能呈现连贯和逻辑思想的能力),以及语气和风格(以评估AI是否可以与学术著作中预期的形式保持一致)。此外,我们将考虑将AI整合到学术写作中的道德含义和实用性。
    方法:为了评估ChatGPT-4和MicrosoftBing在一般实践中的学术论文援助的能力,我们采用了系统的方法。ChatGPT-4是OpenAI的高级AI语言模型,擅长生成类似人类的文本并根据用户交互调整响应,尽管它在2021年9月有一个知识截止。微软Bing的AI聊天机器人方便用户在Bing搜索引擎上进行导航,提供量身定制的搜索。
    结果:就相关性而言,ChatGPT-4深入研究了AI的医疗保健角色,引用学术资料,讨论不同的应用和关注,虽然微软Bing提供了一个简洁的,不太详细的概述。在准确性方面,ChatGPT-4正确引用了72%(23/32)的同行评审文章,但包含了一些不存在的参考文献。微软Bing的准确率为46%(6/13),辅以相关的非同行评审文章。在清晰度方面,两种模型都传达了清晰的信息,连贯的文本。ChatGPT-4特别擅长详细介绍技术概念,而微软Bing更笼统。在语气方面,两位模特都保持着学术的基调,但ChatGPT-4在内容交付方面表现出优越的深度和广度。
    结论:比较ChatGPT-4和MicrosoftBing在学术帮助方面的优势和局限性。ChatGPT-4在深度和相关性方面表现出色,但在引文准确性方面却步履蹒跚。MicrosoftBing简明扼要,但缺乏强大的细节。虽然这两种模式都有潜力,两者都不能独立处理全面的学术任务。随着AI的发展,将ChatGPT-4的深度与MicrosoftBing的最新引用相结合,可以优化学术支持。研究人员应该批判性地评估人工智能的产出,以保持学术可信度。
    BACKGROUND: The evolution of artificial intelligence (AI) has significantly impacted various sectors, with health care witnessing some of its most groundbreaking contributions. Contemporary models, such as ChatGPT-4 and Microsoft Bing, have showcased capabilities beyond just generating text, aiding in complex tasks like literature searches and refining web-based queries.
    OBJECTIVE: This study explores a compelling query: can AI author an academic paper independently? Our assessment focuses on four core dimensions: relevance (to ensure that AI\'s response directly addresses the prompt), accuracy (to ascertain that AI\'s information is both factually correct and current), clarity (to examine AI\'s ability to present coherent and logical ideas), and tone and style (to evaluate whether AI can align with the formality expected in academic writings). Additionally, we will consider the ethical implications and practicality of integrating AI into academic writing.
    METHODS: To assess the capabilities of ChatGPT-4 and Microsoft Bing in the context of academic paper assistance in general practice, we used a systematic approach. ChatGPT-4, an advanced AI language model by Open AI, excels in generating human-like text and adapting responses based on user interactions, though it has a knowledge cut-off in September 2021. Microsoft Bing\'s AI chatbot facilitates user navigation on the Bing search engine, offering tailored search.
    RESULTS: In terms of relevance, ChatGPT-4 delved deeply into AI\'s health care role, citing academic sources and discussing diverse applications and concerns, while Microsoft Bing provided a concise, less detailed overview. In terms of accuracy, ChatGPT-4 correctly cited 72% (23/32) of its peer-reviewed articles but included some nonexistent references. Microsoft Bing\'s accuracy stood at 46% (6/13), supplemented by relevant non-peer-reviewed articles. In terms of clarity, both models conveyed clear, coherent text. ChatGPT-4 was particularly adept at detailing technical concepts, while Microsoft Bing was more general. In terms of tone, both models maintained an academic tone, but ChatGPT-4 exhibited superior depth and breadth in content delivery.
    CONCLUSIONS: Comparing ChatGPT-4 and Microsoft Bing for academic assistance revealed strengths and limitations. ChatGPT-4 excels in depth and relevance but falters in citation accuracy. Microsoft Bing is concise but lacks robust detail. Though both models have potential, neither can independently handle comprehensive academic tasks. As AI evolves, combining ChatGPT-4\'s depth with Microsoft Bing\'s up-to-date referencing could optimize academic support. Researchers should critically assess AI outputs to maintain academic credibility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:疾病脚本是一种特定的脚本格式,旨在表示围绕有利条件组织的面向患者的临床知识,故障(即,病理生理过程),和后果。生成人工智能(AI)在继续医学教育中脱颖而出。通过生成AI轻松创建典型的疾病脚本可以帮助理解疾病的关键特征并提高诊断准确性。由于疾病脚本对每个医生都是独特的,因此尚未报告疾病脚本的具体示例的系统总结。
    目的:这项研究调查了生成AI是否可以产生疾病脚本。
    方法:我们使用了ChatGPT-4,一种生成AI,根据日本国家示范核心课程(2022年修订版)和日本初级保健专家培训中不可或缺的疾病和条件,为184种疾病创建疾病脚本。三位医生采用了三层分级量表:“A”表示每种疾病的疾病脚本的内容足以培训医学生,\“B\”表示部分缺乏但可以接受,和“C”表示它在多个方面都有缺陷。
    结果:通过利用ChatGPT-4,我们成功地生成了184种疾病的疾病脚本的每个组成部分,没有任何遗漏。疾病脚本收到\“A,\"\"B,“和”C“评级为56.0%(103/184),28.3%(52/184),和15.8%(29/184),分别。
    结论:通过采用适合医学生的提示,使用ChatGPT-4无缝和即时地创建有用的疾病脚本。技术驱动的疾病脚本是向医学生介绍疾病关键特征的宝贵工具。
    BACKGROUND: An illness script is a specific script format geared to represent patient-oriented clinical knowledge organized around enabling conditions, faults (i.e., pathophysiological process), and consequences. Generative artificial intelligence (AI) stands out as an educational aid in continuing medical education. The effortless creation of a typical illness script by generative AI could help the comprehension of key features of diseases and increase diagnostic accuracy. No systematic summary of specific examples of illness scripts has been reported since illness scripts are unique to each physician.
    OBJECTIVE: This study investigated whether generative AI can generate illness scripts.
    METHODS: We utilized ChatGPT-4, a generative AI, to create illness scripts for 184 diseases based on the diseases and conditions integral to the National Model Core Curriculum in Japan for undergraduate medical education (2022 revised edition) and primary care specialist training in Japan. Three physicians applied a three-tier grading scale: \"A\" denotes that the content of each disease\'s illness script proves sufficient for training medical students, \"B\" denotes that it is partially lacking but acceptable, and \"C\" denotes that it is deficient in multiple respects.
    RESULTS: By leveraging ChatGPT-4, we successfully generated each component of the illness script for 184 diseases without any omission. The illness scripts received \"A,\" \"B,\" and \"C\" ratings of 56.0% (103/184), 28.3% (52/184), and 15.8% (29/184), respectively.
    CONCLUSIONS: Useful illness scripts were seamlessly and instantaneously created using ChatGPT-4 by employing prompts appropriate for medical students. The technology-driven illness script is a valuable tool for introducing medical students to key features of diseases.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:医学文献在临床实践中起着至关重要的作用,促进准确的患者管理和卫生保健专业人员之间的沟通。然而,医疗笔记中的不准确会导致误解和诊断错误。此外,文件的要求有助于医生倦怠。尽管医疗抄写员和语音识别软件等中介已经被用来减轻这种负担,它们在准确性和解决特定于提供商的指标方面有局限性。环境人工智能(AI)支持的解决方案的集成提供了一种有希望的方式来改进文档,同时无缝地融入现有的工作流程。
    目的:本研究旨在评估主观,Objective,评估,和AI模型ChatGPT-4生成的计划(SOAP)注释,使用既定的历史和体格检查成绩单作为黄金标准。我们试图识别潜在的错误,并评估不同类别的模型性能。
    方法:我们进行了代表各种门诊专业的模拟患者-提供者相遇,并转录了音频文件。确定了关键的可报告元素,ChatGPT-4用于根据这些转录本生成SOAP注释。创建了每个注释的三个版本,并通过图表审查与黄金标准进行了比较;比较产生的错误被归类为遗漏,不正确的信息,或添加。我们比较了不同版本数据元素的准确性,转录本长度,和数据类别。此外,我们使用医师文档质量仪器(PDQI)评分系统评估笔记质量.
    结果:尽管ChatGPT-4始终生成SOAP风格的注释,有,平均而言,23.6每个临床病例的错误,遗漏错误(86%)是最常见的,其次是添加错误(10.5%)和包含不正确的事实(3.2%)。同一案例的重复之间存在显着差异,在所有3个重复中,只有52.9%的数据元素报告正确。数据元素的准确性因案例而异,在“目标”部分中观察到最高的准确性。因此,纸币质量的衡量标准,由PDQI评估,显示了病例内和病例间的差异。最后,ChatGPT-4的准确性与转录本长度(P=.05)和可评分数据元素的数量(P=.05)呈负相关。
    结论:我们的研究揭示了错误的实质性差异,准确度,和由ChatGPT-4产生的注释质量。错误不限于特定部分,和错误类型的不一致复制复杂的可预测性。成绩单长度和数据复杂度与音符准确度成反比,这引起了人们对该模式在处理复杂医疗案件中的有效性的担忧。ChatGPT-4产生的临床笔记的质量和可靠性不符合临床使用所需的标准。尽管AI在医疗保健领域充满希望,在广泛采用之前,应谨慎行事。需要进一步的研究来解决准确性问题,可变性,和潜在的错误。ChatGPT-4,虽然在各种应用中很有价值,目前不应该被认为是人类产生的临床文件的安全替代品。
    BACKGROUND: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows.
    OBJECTIVE: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model\'s performance across different categories.
    METHODS: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system.
    RESULTS: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the \"Objective\" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05).
    CONCLUSIONS: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model\'s effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工耳蜗植入是严重听力损失患者的一项重要手术干预措施。术后护理对于成功康复至关重要,然而,获得及时的医疗建议可能是具有挑战性的,尤其是在远程或资源有限的设置中。在术后护理中集成高级人工智能(AI)工具,如ChatGenerativePre-trainedTransformer(ChatGPT)-4,可以弥合患者教育和支持差距。
    目的:本研究旨在评估ChatGPT-4作为术后人工耳蜗植入患者补充信息资源的有效性。重点是评估AI聊天机器人提供准确的能力,clear,和相关信息,特别是在接触医疗保健专业人员有限的情况下。
    方法:ChatGPT-4提出了5个与人工耳蜗相关的术后常见问题。对AI聊天机器人的回答进行了准确性分析,响应时间,清晰度,和相关性。目的是确定ChatGPT-4是否可以作为有需要的患者的可靠信息来源,特别是如果病人当时无法联系到医院或专家。
    结果:ChatGPT-4提供了符合当前医学指南的反应,证明准确性和相关性。AI聊天机器人在几秒钟内回复了每个查询,表明其作为及时资源的潜力。此外,回答是明确和可以理解的,使复杂的医疗信息对非医疗受众。这些发现表明,ChatGPT-4可以有效地补充传统的患者教育,为术后护理提供有价值的支持。
    结论:该研究得出结论,ChatGPT-4作为术后人工耳蜗植入患者的支持工具具有重要潜力。虽然它不能取代专业的医疗建议,ChatGPT-4可以提供即时,可访问,和可理解的信息,这在特殊时刻特别有益。这强调了AI在增强患者护理和支持人工耳蜗植入方面的实用性。
    BACKGROUND: Cochlear implantation is a critical surgical intervention for patients with severe hearing loss. Postoperative care is essential for successful rehabilitation, yet access to timely medical advice can be challenging, especially in remote or resource-limited settings. Integrating advanced artificial intelligence (AI) tools like Chat Generative Pre-trained Transformer (ChatGPT)-4 in post-surgical care could bridge the patient education and support gap.
    OBJECTIVE: This study aimed to assess the effectiveness of ChatGPT-4 as a supplementary information resource for postoperative cochlear implant patients. The focus was on evaluating the AI chatbot\'s ability to provide accurate, clear, and relevant information, particularly in scenarios where access to healthcare professionals is limited.
    METHODS: Five common postoperative questions related to cochlear implant care were posed to ChatGPT-4. The AI chatbot\'s responses were analyzed for accuracy, response time, clarity, and relevance. The aim was to determine whether ChatGPT-4 could serve as a reliable source of information for patients in need, especially if the patients could not reach out to the hospital or the specialists at that moment.
    RESULTS: ChatGPT-4 provided responses aligned with current medical guidelines, demonstrating accuracy and relevance. The AI chatbot responded to each query within seconds, indicating its potential as a timely resource. Additionally, the responses were clear and understandable, making complex medical information accessible to non-medical audiences. These findings suggest that ChatGPT-4 could effectively supplement traditional patient education, providing valuable support in postoperative care.
    CONCLUSIONS: The study concluded that ChatGPT-4 has significant potential as a supportive tool for cochlear implant patients post surgery. While it cannot replace professional medical advice, ChatGPT-4 can provide immediate, accessible, and understandable information, which is particularly beneficial in special moments. This underscores the utility of AI in enhancing patient care and supporting cochlear implantation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的本研究旨在评估三种大型语言模型(LLM)的性能,生成预训练转换器(GPT)-3.5、GPT-4和GoogleBard,2023年日本国家牙医考试(JNDE),并评估其在日本的潜在临床应用。方法采用2023年JNDE的185道题。这些问题按问题类型和类别分类。McNemar的测试比较了两个LLM之间的正确反应率,而费舍尔的精确检验评估了LLM在每个问题类别中的表现。结果GPT-4的总体正确应答率为73.5%,Bard为66.5%,GPT-3.5为51.9%。GPT-4显示出明显高于Bard和GPT-3.5的正确反应率。在基本问题类别中,巴德的正确反应率为80.5%,超过80%的合格标准。相比之下,GPT-4和GPT-3.5均未达到该基准,GPT-4达到77.6%,GPT-3.5只有52.5%。GPT-4和Bard评分明显高于GPT-3.5(p<0.01)。对于一般性问题,GPT-4的正确反应率为71.2%,Bard的正确反应率为58.5%,GPT-3.5为52.5%。GPT-4优于GPT-3.5和Bard(p<0.01)。专业牙科问题的正确回答率为GPT-4为51.6%,巴德为45.3%,GPT-3.5为35.9%。模型间差异无统计学意义。与其他类型的问题相比,所有LLM对牙科问题的准确性明显较低(p<0.01)。结论GPT-4在JNDE中取得了最高的总分,其次是巴德和GPT-3.5。然而,只有巴德在基本问题上超过了及格分数。为了进一步了解LLM在全球临床牙科中的应用,需要对他们在不同语言的牙科检查中的表现进行更多的研究。
    Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar\'s test compared the correct response rates between two LLMs, while Fisher\'s exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:屈光手术研究旨在根据患者对各种类型手术的适用性对患者进行最佳预分类。最近的进步导致了人工智能驱动的算法的发展,包括机器学习方法,评估风险并增强工作流程。像ChatGPT-4(OpenAILP)这样的大型语言模型(LLM)已经成为潜在的通用人工智能工具,可以跨各个学科提供帮助。可能包括屈光手术决策。然而,他们根据真实世界参数对屈光手术患者进行预分类的实际能力仍未得到探索.
    目的:这项探索性研究旨在验证ChatGPT-4根据常用临床参数对屈光手术患者进行预分类的能力。目的是评估ChatGPT-4在对批次输入进行分类时的性能是否与屈光外科医生的性能相当。比较了一组简单的二进制类别(是否适合激光屈光手术的患者)以及一组更详细的类别。
    方法:对来自屈光诊所的100名连续患者的数据进行匿名化和分析。参数包括年龄,性别,明显的折射,视敏度,以及Scheimpflug成像的各种角膜测量和指数。本研究使用科恩κ系数将ChatGPT-4的表现与临床医生的分类进行了比较,卡方检验,一个混淆矩阵,准确度,精度,召回,F1分数,曲线下的接收机工作特征面积。
    结果:在ChatGPT-4和临床医生的分类之间发现了具有统计学意义的非巧合,其中6个类别的Cohenκ系数为0.399(95%CI0.256-0.537),二元分类为0.610(95%CI0.372-0.792)。该模型显示了时间不稳定性和响应变异性,however.对6个类别的卡方检验表明2个评分者分布之间存在关联(χ²5=94.7,P<.001)。这里,准确率为0.68,准确率为0.75,召回率为0.68,F1评分为0.70.对于两个类别,准确率为0.88,精确度为0.88,召回率为0.88,F1评分为0.88,曲线下面积为0.79.
    结论:这项研究表明,ChatGPT-4在屈光手术中具有作为分类前工具的潜力,与临床医生分类显示出有希望的一致性。然而,其主要局限性包括,其中,只依赖一个人类评估者,小样本量,ChatGPT(OpenAILP)输出在迭代之间的不稳定性和可变性以及底层模型的不透明度。结果鼓励进一步探索ChatGPT-4等LLM在医疗保健中的应用,特别是在需要了解大量临床数据的决策过程中。未来的研究应该集中在定义模型的准确性与提示和插图标准化,检测混杂因素,并与其他版本的ChatGPT-4和其他LLM进行比较,为更大规模的验证和现实世界的实现铺平道路。
    BACKGROUND: Refractive surgery research aims to optimally precategorize patients by their suitability for various types of surgery. Recent advances have led to the development of artificial intelligence-powered algorithms, including machine learning approaches, to assess risks and enhance workflow. Large language models (LLMs) like ChatGPT-4 (OpenAI LP) have emerged as potential general artificial intelligence tools that can assist across various disciplines, possibly including refractive surgery decision-making. However, their actual capabilities in precategorizing refractive surgery patients based on real-world parameters remain unexplored.
    OBJECTIVE: This exploratory study aimed to validate ChatGPT-4\'s capabilities in precategorizing refractive surgery patients based on commonly used clinical parameters. The goal was to assess whether ChatGPT-4\'s performance when categorizing batch inputs is comparable to those made by a refractive surgeon. A simple binary set of categories (patient suitable for laser refractive surgery or not) as well as a more detailed set were compared.
    METHODS: Data from 100 consecutive patients from a refractive clinic were anonymized and analyzed. Parameters included age, sex, manifest refraction, visual acuity, and various corneal measurements and indices from Scheimpflug imaging. This study compared ChatGPT-4\'s performance with a clinician\'s categorizations using Cohen κ coefficient, a chi-square test, a confusion matrix, accuracy, precision, recall, F1-score, and receiver operating characteristic area under the curve.
    RESULTS: A statistically significant noncoincidental accordance was found between ChatGPT-4 and the clinician\'s categorizations with a Cohen κ coefficient of 0.399 for 6 categories (95% CI 0.256-0.537) and 0.610 for binary categorization (95% CI 0.372-0.792). The model showed temporal instability and response variability, however. The chi-square test on 6 categories indicated an association between the 2 raters\' distributions (χ²5=94.7, P<.001). Here, the accuracy was 0.68, precision 0.75, recall 0.68, and F1-score 0.70. For 2 categories, the accuracy was 0.88, precision 0.88, recall 0.88, F1-score 0.88, and area under the curve 0.79.
    CONCLUSIONS: This study revealed that ChatGPT-4 exhibits potential as a precategorization tool in refractive surgery, showing promising agreement with clinician categorizations. However, its main limitations include, among others, dependency on solely one human rater, small sample size, the instability and variability of ChatGPT\'s (OpenAI LP) output between iterations and nontransparency of the underlying models. The results encourage further exploration into the application of LLMs like ChatGPT-4 in health care, particularly in decision-making processes that require understanding vast clinical data. Future research should focus on defining the model\'s accuracy with prompt and vignette standardization, detecting confounding factors, and comparing to other versions of ChatGPT-4 and other LLMs to pave the way for larger-scale validation and real-world implementation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:GPT-4的可靠性,GPT-4是专门研究临床推理和医学知识的最先进的扩展语言模型,在非英语语言中仍未得到验证。
    目的:本研究旨在通过使用普通医学培训考试(GM-ITE)比较日本居民和GPT-4之间的基本临床能力。
    方法:我们使用OpenAI提供的GPT-4模型和2020年,2021年和2022年的GM-ITE考试试题进行了比较分析。这项分析的重点是评估与GPT-4相比,即将结束居住第二年的个人的表现。鉴于GPT-4目前的能力,我们的研究只包括单项选择题,不包括那些涉及音频的,视频,或图像数据。评估包括4个类别:一般理论(专业和医学面试),症状学和临床推理,体检和临床程序,具体的疾病。此外,我们将问题分为7个专业领域和3个难度级别,这是根据居民的正确反应率确定的。
    结果:在检查了137个日语GM-ITE问题后,GPT-4得分显著高于居民平均得分(居民:55.8%,GPT-4:70.1%;P<.001)。就具体学科而言,GPT-4在“特定疾病”中得分高23.5分,“妇产科高出30.9分,“和26.1点在内科。“相比之下,GPT-4在“医学面试和专业性”方面的得分,\"\"一般做法,“和”精神病学“低于居民,尽管这种差异没有统计学意义.在根据问题难度分析分数后,GPT-4分数对于容易的问题(P=.007)低17.2分,但对于正常和困难的问题高25.4分和24.4分,分别(P<.001)。在同比比较中,在2020年(P=0.01)和2022年(P=0.003)考试中,GPT-4得分分别高出21.7和21.5分,分别,但在2021年的考试中仅高出3.5分(没有显着差异)。
    结论:在日语中,GPT-4在GM-ITE测试中的表现也优于普通医疗居民,最初是为他们设计的。具体来说,GPT-4在居民正确应答率低的难题和要求对疾病有更全面了解的难题上表现出得分较高的趋势。然而,GPT-4在居民可以轻易回答的问题上得分相对较低,比如测试对病人的态度和专业精神,以及那些需要理解上下文和沟通的人。这些发现凸显了人工智能在医学教育和实践中的优势和局限性。
    BACKGROUND: The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages.
    OBJECTIVE: This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE).
    METHODS: We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents\' correct response rates.
    RESULTS: Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the \"specific diseases,\" 30.9 points higher in \"obstetrics and gynecology,\" and 26.1 points higher in \"internal medicine.\" In contrast, GPT-4 scores in \"medical interviewing and professionalism,\" \"general practice,\" and \"psychiatry\" were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference).
    CONCLUSIONS: In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:随着人工智能(AI)在医疗保健中的日益融合,像ChatGPT-4这样的人工智能聊天机器人正被用来提供健康信息。
    目的:本研究旨在评估ChatGPT-4在回答与腹部成形术相关的常见问题方面的能力,评估其作为患者教育和术前咨询辅助工具的潜力。
    方法:对ChatGPT-4提交了关于腹部成形术的各种常见问题。这些问题来自美国整形外科学会提供的问题列表,以确保它们的相关性和全面性。一位经验丰富的整形外科医生仔细评估了ChatGPT-4在信息深度方面产生的反应,反应衔接,和能力,以确定人工智能在提供以患者为中心的信息方面的熟练程度。
    结果:研究表明ChatGPT-4可以给出明确的答案,使其对回答常见的查询有用。然而,它挣扎着个性化的建议,有时提供不正确或过时的参考。总的来说,ChatGPT-4可以有效地共享腹部成形术信息,这可以帮助患者更好地理解手术。尽管有这些积极的发现,人工智能需要更多的改进,特别是在提供个性化和准确的信息方面,充分满足整形外科患者的教育需求。
    结论:尽管ChatGPT-4有望成为患者教育的资源,持续的改进和严格的检查对于将其有利地融入医疗保健环境至关重要。研究强调需要进一步研究,特别侧重于提高人工智能响应的个性化和准确性。
    方法:本期刊要求作者为每篇文章分配一定程度的证据。对于这些循证医学评级的完整描述,请参阅目录或在线作者说明www。springer.com/00266.
    BACKGROUND: With the increasing integration of artificial intelligence (AI) in health care, AI chatbots like ChatGPT-4 are being used to deliver health information.
    OBJECTIVE: This study aimed to assess the capability of ChatGPT-4 in answering common questions related to abdominoplasty, evaluating its potential as an adjunctive tool in patient education and preoperative consultation.
    METHODS: A variety of common questions about abdominoplasty were submitted to ChatGPT-4. These questions were sourced from a question list provided by the American Society of Plastic Surgery to ensure their relevance and comprehensiveness. An experienced plastic surgeon meticulously evaluated the responses generated by ChatGPT-4 in terms of informational depth, response articulation, and competency to determine the proficiency of the AI in providing patient-centered information.
    RESULTS: The study showed that ChatGPT-4 can give clear answers, making it useful for answering common queries. However, it struggled with personalized advice and sometimes provided incorrect or outdated references. Overall, ChatGPT-4 can effectively share abdominoplasty information, which may help patients better understand the procedure. Despite these positive findings, the AI needs more refinement, especially in providing personalized and accurate information, to fully meet patient education needs in plastic surgery.
    CONCLUSIONS: Although ChatGPT-4 shows promise as a resource for patient education, continuous improvements and rigorous checks are essential for its beneficial integration into healthcare settings. The study emphasizes the need for further research, particularly focused on improving the personalization and accuracy of AI responses.
    METHODS: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    简介本案例研究旨在通过采用逐步的系统方法来提高医学文本中ChatGPT-4的可追溯性和检索准确性。重点是从三个关于糖尿病酮症酸中毒(DKA)的国际指南中检索临床答案。方法建立了系统的方法来指导检索过程。每个指南都提出了一个问题,以确保准确性并保持引用。ChatGPT-4被用来检索答案,并集成了“链接阅读器”插件,以方便直接访问包含指南的网页。随后,ChatGPT-4用于编译答案,同时提供对来源的引用。每个问题重复这个过程30次,以确保一致性。在这份报告中,我们介绍了我们对检索准确性的观察,反应的一致性,以及在此过程中遇到的挑战。结果将ChatGPT-4与“链接阅读器”插件集成在一起显示了显着的可追溯性和检索准确性优势。根据分析的指南,AI模型成功提供了相关且准确的临床答案。尽管偶尔会遇到网页访问和轻微的内存漂移的挑战,集成系统的整体性能是有希望的。答案的汇编也令人印象深刻,并为进一步的审判带来了巨大的希望。结论本案例研究的结果有助于利用AI文本生成模型作为医学专业人员和研究人员的有价值的工具。本案例研究中采用的系统方法和“链接阅读器”插件的集成为自动化医学文本合成提供了一个框架,在从不同来源编译之前一次问一个问题,这提高了人工智能模型的可追溯性和检索准确性。AI模型的进一步改进和完善以及与其他软件实用程序的集成有望增强AI生成的建议在医学和科学学术界的实用性和适用性。这些进步有可能推动日常医疗实践的重大改进。
    Introduction This case study aimed to enhance the traceability and retrieval accuracy of ChatGPT-4 in medical text by employing a step-by-step systematic approach. The focus was on retrieving clinical answers from three international guidelines on diabetic ketoacidosis (DKA). Methods A systematic methodology was developed to guide the retrieval process. One question was asked per guideline to ensure accuracy and maintain referencing. ChatGPT-4 was utilized to retrieve answers, and the \'Link Reader\' plug-in was integrated to facilitate direct access to webpages containing the guidelines. Subsequently, ChatGPT-4 was employed to compile answers while providing citations to the sources. This process was iterated 30 times per question to ensure consistency. In this report, we present our observations regarding the retrieval accuracy, consistency of responses, and the challenges encountered during the process. Results Integrating ChatGPT-4 with the \'Link Reader\' plug-in demonstrated notable traceability and retrieval accuracy benefits. The AI model successfully provided relevant and accurate clinical answers based on the analyzed guidelines. Despite occasional challenges with webpage access and minor memory drift, the overall performance of the integrated system was promising. The compilation of the answers was also impressive and held significant promise for further trials. Conclusion The findings of this case study contribute to the utilization of AI text-generation models as valuable tools for medical professionals and researchers. The systematic approach employed in this case study and the integration of the \'Link Reader\' plug-in offer a framework for automating medical text synthesis, asking one question at a time before compilation from different sources, which has led to improving AI models\' traceability and retrieval accuracy. Further advancements and refinement of AI models and integration with other software utilities hold promise for enhancing the utility and applicability of AI-generated recommendations in medicine and scientific academia. These advancements have the potential to drive significant improvements in everyday medical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号