generative AI

创成式 AI
  • 文章类型: Journal Article
    背景:人工智能(AI)的集成,特别是深度学习模型,改变了医疗技术的格局,特别是在使用成像和生理数据的诊断领域。在耳鼻喉科,AI在中耳疾病的图像分类中显示出希望。然而,现有的模型通常缺乏患者特定的数据和临床背景,限制其普遍适用性。GPT-4Vision(GPT-4V)的出现使得多模态诊断方法成为可能,将语言处理与图像分析相结合。
    目的:在本研究中,我们通过整合患者特异性数据和耳镜下鼓膜图像,研究了GPT-4V在诊断中耳疾病中的有效性.
    方法:本研究的设计分为两个阶段:(1)建立具有适当提示的模型和(2)验证最佳提示模型对图像进行分类的能力。总的来说,305个中耳疾病的耳镜图像(急性中耳炎,中耳胆脂瘤,慢性中耳炎,和渗出性中耳炎)来自2010年4月至2023年12月期间访问新州大学或济池医科大学的患者。使用提示和患者数据建立优化的GPT-4V设置,并使用最佳提示创建的模型来验证GPT-4V在190张图像上的诊断准确性。为了比较GPT-4V与医生的诊断准确性,30名临床医生完成了由190张图像组成的基于网络的问卷。
    结果:多模态人工智能方法实现了82.1%的准确率,优于认证儿科医生的70.6%,但落后于耳鼻喉科医生的95%以上。该模型对急性中耳炎的疾病特异性准确率为89.2%,76.5%为慢性中耳炎,79.3%为中耳胆脂瘤,渗出性中耳炎占85.7%,这突出了对疾病特异性优化的需求。与医生的比较显示了有希望的结果,提示GPT-4V增强临床决策的潜力。
    结论:尽管有其优势,必须解决数据隐私和道德考虑等挑战。总的来说,这项研究强调了多模式AI在提高诊断准确性和改善耳鼻喉科患者护理方面的潜力.需要进一步的研究以在不同的临床环境中优化和验证这种方法。
    BACKGROUND: The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis.
    OBJECTIVE: In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane.
    METHODS: The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients\' data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.
    RESULTS: The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model\'s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making.
    CONCLUSIONS: Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项横断面研究评估了临床准确性,相关性,清晰度,以及由大型语言模型(LLM)提供的对接受手术的患者的询问的反应的情感敏感性,强调他们在病人沟通和教育中作为辅助工具的潜力。我们的发现证明了LLM在准确性方面的高性能,相关性,清晰度,和情感敏感性,Anthropic的Claude2胜过OpenAI的ChatGPT和Google的Bard,建议LLM有可能作为增强信息传递和患者-外科医生互动的补充工具。
    This cross-sectional study evaluates the clinical accuracy, relevance, clarity, and emotional sensitivity of responses to inquiries from patients undergoing surgery provided by large language models (LLMs), highlighting their potential as adjunct tools in patient communication and education. Our findings demonstrated high performance of LLMs across accuracy, relevance, clarity, and emotional sensitivity, with Anthropic\'s Claude 2 outperforming OpenAI\'s ChatGPT and Google\'s Bard, suggesting LLMs\' potential to serve as complementary tools for enhanced information delivery and patient-surgeon interaction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    视频记录准确地捕捉面部表情运动;然而,对于面部感知研究人员来说,它们很难标准化和操纵。出于这个原因,照片的动态变形经常被使用,尽管他们缺乏自然的面部运动。这项研究旨在研究人类如何使用真实视频和两种不同的方法来人工生成动态表情-动态变形,从面部感知情绪。和AI合成的深度假货。与视频(所有情绪)和深度假货(恐惧,快乐,sad).视频和deepfakes被认为是类似的。此外,他们感觉到了快乐和悲伤的变形,但没有变形的愤怒或恐惧,不如其他格式真实。我们的发现支持先前的研究,表明对变形情绪的社会反应并不代表视频记录。研究结果还表明,与变体相比,深度假货可能提供更合适的标准化刺激类型。此外,从参与者那里收集定性数据,并使用ChatGPT进行分析,一个大的语言模型。ChatGPT成功地在数据中确定了与独立人类研究人员确定的主题一致的主题。根据这一分析,我们的参与者认为动态变形与视频和深度假货相比不那么自然。参与者认为深度假货和视频类似地表明,深度假货有效地复制了自然的面部运动,使它们成为面部感知研究的有希望的替代品。这项研究有助于越来越多的研究探索生成人工智能对推进人类感知研究的有用性。
    Video recordings accurately capture facial expression movements; however, they are difficult for face perception researchers to standardise and manipulate. For this reason, dynamic morphs of photographs are often used, despite their lack of naturalistic facial motion. This study aimed to investigate how humans perceive emotions from faces using real videos and two different approaches to artificially generating dynamic expressions - dynamic morphs, and AI-synthesised deepfakes. Our participants perceived dynamic morphed expressions as less intense when compared with videos (all emotions) and deepfakes (fearful, happy, sad). Videos and deepfakes were perceived similarly. Additionally, they perceived morphed happiness and sadness, but not morphed anger or fear, as less genuine than other formats. Our findings support previous research indicating that social responses to morphed emotions are not representative of those to video recordings. The findings also suggest that deepfakes may offer a more suitable standardized stimulus type compared to morphs. Additionally, qualitative data were collected from participants and analysed using ChatGPT, a large language model. ChatGPT successfully identified themes in the data consistent with those identified by an independent human researcher. According to this analysis, our participants perceived dynamic morphs as less natural compared with videos and deepfakes. That participants perceived deepfakes and videos similarly suggests that deepfakes effectively replicate natural facial movements, making them a promising alternative for face perception research. The study contributes to the growing body of research exploring the usefulness of generative artificial intelligence for advancing the study of human perception.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:医学文献在临床实践中起着至关重要的作用,促进准确的患者管理和卫生保健专业人员之间的沟通。然而,医疗笔记中的不准确会导致误解和诊断错误。此外,文件的要求有助于医生倦怠。尽管医疗抄写员和语音识别软件等中介已经被用来减轻这种负担,它们在准确性和解决特定于提供商的指标方面有局限性。环境人工智能(AI)支持的解决方案的集成提供了一种有希望的方式来改进文档,同时无缝地融入现有的工作流程。
    目的:本研究旨在评估主观,Objective,评估,和AI模型ChatGPT-4生成的计划(SOAP)注释,使用既定的历史和体格检查成绩单作为黄金标准。我们试图识别潜在的错误,并评估不同类别的模型性能。
    方法:我们进行了代表各种门诊专业的模拟患者-提供者相遇,并转录了音频文件。确定了关键的可报告元素,ChatGPT-4用于根据这些转录本生成SOAP注释。创建了每个注释的三个版本,并通过图表审查与黄金标准进行了比较;比较产生的错误被归类为遗漏,不正确的信息,或添加。我们比较了不同版本数据元素的准确性,转录本长度,和数据类别。此外,我们使用医师文档质量仪器(PDQI)评分系统评估笔记质量.
    结果:尽管ChatGPT-4始终生成SOAP风格的注释,有,平均而言,23.6每个临床病例的错误,遗漏错误(86%)是最常见的,其次是添加错误(10.5%)和包含不正确的事实(3.2%)。同一案例的重复之间存在显着差异,在所有3个重复中,只有52.9%的数据元素报告正确。数据元素的准确性因案例而异,在“目标”部分中观察到最高的准确性。因此,纸币质量的衡量标准,由PDQI评估,显示了病例内和病例间的差异。最后,ChatGPT-4的准确性与转录本长度(P=.05)和可评分数据元素的数量(P=.05)呈负相关。
    结论:我们的研究揭示了错误的实质性差异,准确度,和由ChatGPT-4产生的注释质量。错误不限于特定部分,和错误类型的不一致复制复杂的可预测性。成绩单长度和数据复杂度与音符准确度成反比,这引起了人们对该模式在处理复杂医疗案件中的有效性的担忧。ChatGPT-4产生的临床笔记的质量和可靠性不符合临床使用所需的标准。尽管AI在医疗保健领域充满希望,在广泛采用之前,应谨慎行事。需要进一步的研究来解决准确性问题,可变性,和潜在的错误。ChatGPT-4,虽然在各种应用中很有价值,目前不应该被认为是人类产生的临床文件的安全替代品。
    BACKGROUND: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows.
    OBJECTIVE: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model\'s performance across different categories.
    METHODS: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system.
    RESULTS: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the \"Objective\" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05).
    CONCLUSIONS: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model\'s effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:尽管患者可以通过患者门户轻松访问其电子健康记录和实验室检查结果数据,实验室测试结果往往令人困惑和难以理解。许多患者转向基于网络的论坛或问答(Q&A)网站,以寻求同龄人的建议。社交问答网站对健康相关问题的回答质量差异很大,并不是所有的反应都是准确或可靠的。诸如ChatGPT之类的大型语言模型(LLM)为患者提供了一条有希望的途径来回答他们的问题。
    目的:我们旨在评估使用LLM来生成相关的,准确,乐于助人,以及对患者提出的实验室测试相关问题的无害反应,并确定可以使用增强方法缓解的潜在问题。
    方法:我们从Yahoo!Answers收集了实验室测试结果相关的问答数据,并为本研究选择了53对问答。使用LangChain框架和ChatGPT门户网站,我们从5个LLM产生了对53个问题的回答:GPT-4,GPT-3.5,LLaMA2,MedAlpaca,和ORCA_mini。我们使用基于标准Q&A相似性的评估指标评估他们答案的相似性,包括召回导向的激励评估,双语评价研究,用显式排序评价翻译的度量,和双向编码器表示从变形金刚得分。我们使用基于LLM的评估器来判断目标模型在相关性方面是否具有更高的质量,正确性,乐于助人,和安全性比基线模型。我们与医学专家一起对相同4个方面的7个选定问题的所有回答进行了手动评估。
    结果:关于来自4个LLM的响应的相似性;GPT-4输出被用作参考答案,GPT-3.5的反应最相似,其次是LLaMA2,ORCA_mini,和MedAlpaca.雅虎数据中的人类答案得分最低,因此,与GPT-4生成的答案最不相似。胜率和医学专家评估的结果都表明,GPT-4的反应在所有4个方面都比所有其他LLM反应和人类反应获得了更好的分数(相关性,正确性,乐于助人,和安全)。LLM反应偶尔也会因缺乏医学背景下的解释而受到影响,不正确的陈述,缺乏参考。
    结论:通过评估LLM对患者实验室测试结果相关问题的反应,我们发现,与问答网站上的其他4个LLM和人类答案相比,GPT-4的反应更准确,乐于助人,相关,和更安全。存在GPT-4应答不准确且未个体化的情况。我们确定了许多提高LLM响应质量的方法,包括及时的工程,提示增强,检索增强生成,和响应评估。
    BACKGROUND: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered.
    OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches.
    METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects.
    RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4\'s responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one\'s medical context, incorrect statements, and lack of references.
    CONCLUSIONS: By evaluating LLMs in generating responses to patients\' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4\'s responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • DOI:
    文章类型: Preprint
    实验室结果往往令人困惑,难以理解。诸如ChatGPT之类的大型语言模型(LLM)为患者提供了一条有希望的途径来回答他们的问题。我们的目标是评估使用LLM生成相关、准确,乐于助人,以及对患者提出的实验室测试相关问题的无害响应,并确定可以通过增强方法缓解的潜在问题。我们首先从Yahoo!Answers收集了实验室测试结果相关的问答数据,并为本研究选择了53对QA。使用LangChain框架和ChatGPT门户网站,我们从四个LLM中产生了对53个问题的回答,包括GPT-4、MetaLLaMA2、MedAlpaca、和ORCA_mini。我们首先使用标准的基于QA相似性的评估指标评估他们答案的相似性,包括ROUGE,BLEU,METEOR,Bertscore.我们还利用基于LLM的评估器判断目标模型是否在相关性方面具有更高的质量,正确性,乐于助人,和安全性比基线模型。最后,我们与医学专家进行了手动评估,对相同的四个方面的七个选定问题的所有回答。WinRate和医学专家评估的结果都表明,GPT-4的反应在所有四个方面都比所有其他LLM反应和人类反应获得更好的分数(相关性,正确性,乐于助人,和安全)。然而,LLM的反应偶尔也会在一个人的医学环境中缺乏解释,不正确的陈述,缺乏参考。我们发现,与问答网站上的其他三个LLM和人类答案相比,GPT-4的回答更准确,乐于助人,相关,和更安全。然而,在某些情况下,GPT-4应答不准确且未个体化.我们确定了许多提高LLM响应质量的方法。
    UNASSIGNED: Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered.
    UNASSIGNED: We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches.
    UNASSIGNED: We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects.
    UNASSIGNED: Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4\'s responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one\'s medical context, incorrect statements, and lack of references.
    UNASSIGNED: By evaluating LLMs in generating responses to patients\' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4\'s responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在当代教育环境中,将生成人工智能(GAI)纳入教育已变得至关重要。这篇研究文章彻底调查了在沙特阿拉伯高等教育背景下实施GAI的后果,采用定量和定性研究方法的混合。基于调查的定量数据揭示了教育者对GAI的认识与其应用频率之间值得注意的相关性。值得注意的是,大约一半的受访教育工作者处于理解和熟悉GAI集成的阶段,表明了对其采用的切实准备。此外,这项研究的定量结果强调了与整合GAI相关的感知价值和易用性,从而加强了这样一个假设,即教育工作者有动力并倾向于将ChatGPT等GAI工具集成到他们的教学方法中。除了定量分析,从对教育工作者的深入采访中获得的定性见解揭示了丰富的视角。定性数据强调了GAI作为协作学习催化剂的作用,促进专业发展,培养创新的教学实践。
    Incorporating generative artificial intelligence (GAI) in education has become crucial in contemporary educational environments. This research article thoroughly investigates the ramifications of implementing GAI in the higher education context of Saudi Arabia, employing a blend of quantitative and qualitative research approaches. Survey-based quantitative data reveals a noteworthy correlation between educators\' awareness of GAI and the frequency of its application. Notably, around half of the surveyed educators are at stages characterized by understanding and familiarity with GAI integration, indicating a tangible readiness for its adoption. Moreover, the study\'s quantitative findings underscore the perceived value and ease associated with integrating GAI, thus reinforcing the assumption that educators are motivated and inclined to integrate GAI tools like ChatGPT into their teaching methodologies. In addition to the quantitative analysis, qualitative insights from in-depth interviews with educators unveil a rich tapestry of perspectives. The qualitative data emphasizes GAI\'s role as a catalyst for collaborative learning, contributing to professional development, and fostering innovative teaching practices.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:本研究探讨了生成人工智能工具(ChatGPT)作为护士临床支持的潜力。具体来说,我们的目的是评估ChatGPT是否能够证明临床决策与专业护士和新手护生相同.这将通过比较ChatGPT对临床情景的反应与不同经验水平的护士的反应来进行评估。
    方法:这是一项横断面研究。
    方法:在2023年3月至4月期间招募急诊室注册护士(即专家;n=30)和护理专业学生(即新手;n=38)。使用涉及初始评估和重新评估的三种经过验证的临床方案来衡量临床决策。评估的临床决策方面是初始评估的准确性,推荐测试和资源使用的适当性以及重新评估决策的能力。还通过时序响应世代和字数比较了性能。专家护士和新手学生完成在线问卷调查(通过Qualtrics),而ChatGPT应答是从OpenAI获得的。
    结果:关于临床决策的各个方面以及与新手和专家的比较:(1)ChatGPT在初始评估中表现出优柔寡断;(2)ChatGPT倾向于建议不必要的诊断测试;(3)当需要重新评估新信息时,ChatGPT反应显示不准确的理解和不适当的修改。在性能方面,ChatGPT答案中使用的平均字数比专家和新手使用的字数大27-41倍;提供的回答比新手快约4倍,比专家护士快2倍.ChatGPT响应保持逻辑结构和清晰度。
    结论:与人类临床医生相比,一种生成AI工具表现出优柔寡断和过度分类的趋势。
    结论:该研究表明,谨慎地将ChatGPT作为护士的数字助理来实施是很重要的。需要更多的研究来优化模型的训练和算法,以提供准确的医疗保健支持,帮助临床决策。
    本研究遵循相关的EQUATOR指南报告观察性研究。
    患者未直接参与本研究的进行。
    OBJECTIVE: This study explores the potential of a generative artificial intelligence tool (ChatGPT) as clinical support for nurses. Specifically, we aim to assess whether ChatGPT can demonstrate clinical decision-making equivalent to that of expert nurses and novice nursing students. This will be evaluated by comparing ChatGPT responses to clinical scenarios to those of nurses on different levels of experience.
    METHODS: This is a cross-sectional study.
    METHODS: Emergency room registered nurses (i.e. experts; n = 30) and nursing students (i.e. novices; n = 38) were recruited during March-April 2023. Clinical decision-making was measured using three validated clinical scenarios involving an initial assessment and reevaluation. Clinical decision-making aspects assessed were the accuracy of initial assessments, the appropriateness of recommended tests and resource use and the capacity to reevaluate decisions. Performance was also compared by timing response generations and word counts. Expert nurses and novice students completed online questionnaires (via Qualtrics), while ChatGPT responses were obtained from OpenAI.
    RESULTS: Concerning aspects of clinical decision-making and compared to novices and experts: (1) ChatGPT exhibited indecisiveness in initial assessments; (2) ChatGPT tended to suggest unnecessary diagnostic tests; (3) When new information required re-evaluation, ChatGPT responses demonstrated inaccurate understanding and inappropriate modifications. In terms of performance, the mean number of words utilized in ChatGPT answers was 27-41 times greater than that utilized by both experts and novices; and responses were provided approximately 4 times faster than those of novices and twice faster than expert nurses. ChatGPT responses maintained logical structure and clarity.
    CONCLUSIONS: A generative AI tool demonstrated indecisiveness and a tendency towards over-triage compared to human clinicians.
    CONCLUSIONS: The study shows that it is important to approach the implementation of ChatGPT as a nurse\'s digital assistant with caution. More study is needed to optimize the model\'s training and algorithms to provide accurate healthcare support that aids clinical decision-making.
    UNASSIGNED: This study adhered to relevant EQUATOR guidelines for reporting observational studies.
    UNASSIGNED: Patients were not directly involved in the conduct of this study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:评估使用环境聆听/数字划线解决方案(NuanceDragon环境体验(DAX))对护理人员参与的影响,在电子健康记录(EHR)上花费的时间,包括下班后的时间,生产力,基于价值的护理提供者的属性面板大小,文档的及时性,和当前程序术语(CPT)提交。
    方法:我们于2022年3月至9月进行了同行匹配的对照队列研究,以评估DAX在综合医疗保健系统中门诊诊所的影响。主要结果测量包括提供者参与度调查结果,报告与使用DAX相关的患者安全事件,患者推荐评分的可能性,选择退出环境听力的患者数量,工作相对值单位的变化,基于属性价值的初级保健小组规模,文档完成和CPT代码提交不足率,并记下周转时间。
    结果:共有代表12个专业的99个提供者参加了研究;76个匹配的对照组提供者被纳入分析。活跃参与者中DAX的平均利用率为47%。我们发现提供商参与的积极趋势,而非参与者则看到参与度恶化,生产力没有实际变化。小时后EHR有统计学上显著的恶化。对患者安全没有可量化的影响。
    结论:NuanceDAX的使用在提供者参与方面显示出积极的趋势,对患者安全没有风险,经验,或临床文件。对患者体验没有显著益处,文档,或衡量提供者的生产力。
    结论:我们的结果强调了环境听写作为改善提供者体验的工具的潜力。需要进行EHR文档效率培训的头对头比较。
    OBJECTIVE: To assess the impact of the use of an ambient listening/digital scribing solution (Nuance Dragon Ambient eXperience (DAX)) on caregiver engagement, time spent on Electronic Health Record (EHR) including time after hours, productivity, attributed panel size for value-based care providers, documentation timeliness, and Current Procedural Terminology (CPT) submissions.
    METHODS: We performed a peer-matched controlled cohort study from March to September 2022 to evaluate the impact of DAX in outpatient clinics in an integrated healthcare system. Primary outcome measurements included provider engagement survey results, reported patient safety events related to DAX use, patients\' Likelihood to Recommend score, number of patients opting out of ambient listening, change in work relative values units, attributed value-based primary care panel size, documentation completion and CPT code submission deficiency rates, and note turnaround time.
    RESULTS: A total of 99 providers representing 12 specialties enrolled in the study; 76 matched control group providers were included for analysis. Median utilization of DAX was 47% among active participants. We found positive trends in provider engagement, while non-participants saw worsening engagement and no practical change in productivity. There was a statistically significant worsening of after-hours EHR. There was no quantifiable effect on patient safety.
    CONCLUSIONS: Nuance DAX use showed positive trends in provider engagement at no risk to patient safety, experience, or clinical documentation. There were no significant benefits to patient experience, documentation, or measures of provider productivity.
    CONCLUSIONS: Our results highlight the potential of ambient dictation as a tool for improving the provider experience. Head-to-head comparisons of EHR documentation efficiency training are needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能(AI)是一个快速发展的领域,具有改变医疗保健和公共卫生各个方面的潜力,包括医学培训。在为五年级医学生开设的“卫生与公共卫生”课程期间,使用AI聊天机器人作为教育支持工具,进行了有关疫苗接种的实用培训课程。在接受疫苗接种的特定培训之前,对学生进行了基于网络的测试,该测试是从意大利国家医学住院医师测试中提取的。完成测试后,AI聊天机器人协助对每个问题进行了严格的更正。
    目的:这项研究的主要目的是确定AI聊天机器人是否可以被视为公共卫生培训的教育支持工具。次要目标是评估不同AI聊天机器人在意大利语复杂的多项选择医学问题上的表现。
    方法:从意大利国家医疗居住权测试中提取了由15个关于疫苗接种的多项选择题组成的测试,使用有针对性的关键词,并通过GoogleForms和不同的AI聊天机器人模型(BingChat,ChatGPT,Chatsonic,谷歌吟游诗人,和YouChat)。考试的纠正是在教室里进行的,专注于对聊天机器人提供的解释进行批判性评估。进行了Mann-WhitneyU测试,以比较医学生和AI聊天机器人的表现。在培训体验结束时匿名收集学生反馈。
    结果:总计,36名医学生和5个AI聊天机器人模型完成了测试。学生在15分中的平均得分为8.22(SD2.65),而AI聊天机器人的平均得分为12.22(SD2.77)。结果表明,两组之间的性能差异具有统计学意义(U=49.5,P<.001),具有较大的效应大小(r=0.69)。当按问题类型划分时(直接,基于场景的,和否定),在直接(P<.001)和基于情景(P<.001)的问题上观察到显著差异,但不是在否定的问题(P=.48)。学生报告对教育经历的满意度很高(7.9/10),表达强烈的重复体验的愿望(7.6/10)。
    结论:这项研究证明了AI聊天机器人在回答与疫苗接种相关的复杂医学问题和提供有价值的教育支持方面的有效性。在直接和基于情景的问题上,他们的表现大大超过了医学生。负责任和批判性地使用人工智能聊天机器人可以增强医学教育,使其成为融入教育系统的一个重要方面。
    BACKGROUND: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the \"Hygiene and Public Health\" course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots.
    OBJECTIVE: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language.
    METHODS: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience.
    RESULTS: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10).
    CONCLUSIONS: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号