Google Bard

谷歌吟游诗人
  • 文章类型: Journal Article
    目标:大型语言模型(LLM)是人工智能(AI)的一种形式,它使用深度学习技术来理解,总结和生成内容。预计LLM在医疗保健方面的潜在好处是巨大的。这项研究的目的是检查由3名LLM制作的有关泌尿外科主题的患者信息传单(PILs)的质量。
    方法:创建提示以从3个LLM中生成PIL:ChatGPT-4,PaLM2(GoogleBard)和Llama2(Meta),涉及四个泌尿外科主题(包皮环切术,肾切除术,膀胱过度活动症,和经尿道前列腺切除术)。使用质量评估清单对PIL进行评估。PIL可读性由平均阅读水平共识计算器评估。
    结果:由PaLM2产生的PIL具有最高的总体平均质量得分(3.58),其次是美洲狮2(3.34)和ChatGPT-4(3.08)。除TURP外,PaLM2生成的PIL在所有主题中均具有最高质量,并且是唯一包含图像的LLM。所有生成的内容中都存在医学上的不准确性,包括重大错误的情况。可读性分析确定PaLM2产生的PIL是最简单的(年龄14-15岁的平均阅读水平)。Llama2PILs是最困难的(平均年龄16-17岁)。
    结论:虽然LLM可以生成有助于减少医疗保健专业工作量的PIL,生成的内容需要临床医生输入,以确保健康素养辅助工具的准确性和包容性,如图像。LLM生成的PIL高于成年人的平均阅读水平,需要改进LLM算法和/或提示设计。患者对LLM生成的PIL的满意度有待评估。
    OBJECTIVE: Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics.
    METHODS: Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator.
    RESULTS: PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14-15 average reading level). Llama 2 PILs were the most difficult (age 16-17 average).
    CONCLUSIONS: While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:新型计算机技术和自动化数据分析的兴起有可能改变牙科教育的过程。根据我们利用人工智能的力量来增强教学的长期目标,这项研究的目的是量化和比较ChatGPT(GPT-4和GPT-3.5)和GoogleGemini提供的响应的准确性,三种主要的大型语言模型(LLM),人类研究生(对照组)对美国牙周病学会(AAP)提出的年度在职考试问题。
    方法:在比较性横断面研究设计下,a语料库的1312个问题从AAP的年度在职考试在2020年和2023年之间被提交给LLM。他们的回答是使用卡方检验分析,并将其表现与相应年份的牙周居民得分并列,作为人类对照组。此外,进行了两个子分析:一个是在考试的每个部分中LLM的表现;以及回答最困难的问题。
    结果:ChatGPT-4(总平均值:79.57%)在所有考试年度中均优于所有人类对照组以及GPT-3.5和GoogleGemini(p<.001)。这个聊天机器人在各个考试年份的准确率范围在78.80%到80.98%之间。双子座一贯表现优异,得分为70.65%(p=0.01),73.29%(p=.02),75.73%(p<0.01),2020年至2023年的考试为72.18%(p=.0008),而ChatGPT-3.5的成绩为62.5%,68.24%,69.83%,分别为59.27%。GoogleGemini(72.86%)超过了所有考试年度的一年级(63.48%±31.67)和二年级居民(66.25%±31.61)的平均成绩。然而,不能超过三年级居民(69.06%±30.45)。
    结论:在本分析的范围内,ChatGPT-4在回答AAP在职考试问题的准确性和可靠性方面表现出强大的能力,而Gemini和ChatGPT-3.5的表现较弱。这些发现强调了在牙周病和口腔种植学领域部署LLM作为教育工具的潜力。然而,这些模型的当前局限性,例如无法有效处理基于图像的查询,对相同提示产生不一致反应的倾向,并实现高(GPT-4为80%),但不应该考虑绝对的准确率。为了进一步发展这一研究领域,需要客观地比较他们的能力。
    BACKGROUND: The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP).
    METHODS: Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions.
    RESULTS: ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45).
    CONCLUSIONS: Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:本研究比较了三种人工智能(AI)平台在确定即将毕业的医生的药物治疗沟通能力方面的潜力。
    方法:我们提出了三个AI平台,即,坡助手©,ChatGPT©和GoogleBard©,使用结构化查询来生成适合于毕业医生的沟通技能能力和案例场景。这些病例包括15种需要药物处方的典型医疗条件。两位作者独立评估了AI增强的临床遭遇,它整合了各种信息,以创建以患者为中心的护理计划。通过使用清单的基于共识的方法,评估了为每种情景生成的通信组件.通过参考英国国家处方集,对每种情况下提供的说明和警告进行了评估。
    结果:AI平台在生成的能力领域中表现出重叠,尽管措辞有所不同。知识领域(基础和临床药理学,开处方,沟通和药物安全)得到了所有平台的一致认可。PoeAssistant©和ChatGPT©在每种情况下特定的药物治疗相关沟通问题上达成了广泛共识。共识主要包括致敬,处方仿制药,治疗目标和随访时间表。在患者的指导清晰度方面观察到差异,列出的副作用,警告和患者赋权。GoogleBard并未就患者沟通问题提供指导。
    结论:AI平台认识到能力与如何陈述的差异。PoeAssistant©和ChatGPT©展示了沟通问题的一致性。然而,在特定的技能成分中观察到显著的差异,表明人为干预对人工智能生成的输出进行批判性评估的必要性。
    OBJECTIVE: This study compared three artificial intelligence (AI) platforms\' potential to identify drug therapy communication competencies expected of a graduating medical doctor.
    METHODS: We presented three AI platforms, namely, Poe Assistant©, ChatGPT© and Google Bard©, with structured queries to generate communication skill competencies and case scenarios appropriate for graduating medical doctors. These case scenarios comprised 15 prototypical medical conditions that required drug prescriptions. Two authors independently evaluated the AI-enhanced clinical encounters, which integrated a diverse range of information to create patient-centred care plans. Through a consensus-based approach using a checklist, the communication components generated for each scenario were assessed. The instructions and warnings provided for each case scenario were evaluated by referencing the British National Formulary.
    RESULTS: AI platforms demonstrated overlap in competency domains generated, albeit with variations in wording. The domains of knowledge (basic and clinical pharmacology, prescribing, communication and drug safety) were unanimously recognized by all platforms. A broad consensus among Poe Assistant© and ChatGPT© on drug therapy-related communication issues specific to each case scenario was evident. The consensus primarily encompassed salutation, generic drug prescribed, treatment goals and follow-up schedules. Differences were observed in patient instruction clarity, listed side effects, warnings and patient empowerment. Google Bard did not provide guidance on patient communication issues.
    CONCLUSIONS: AI platforms recognized competencies with variations in how these were stated. Poe Assistant© and ChatGPT© exhibited alignment of communication issues. However, significant discrepancies were observed in specific skill components, indicating the necessity of human intervention to critically evaluate AI-generated outputs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:近年来,将人工智能(AI)集成到包括妇科在内的各个医学领域,已经显示出有希望的潜力。如果子宫保存和生育能力是主要目标,则肌瘤的手术治疗是子宫肌瘤切除术。AI的使用始于LLM(大型语言模型)的参与,从患者拜访妇科医生开始,从识别体征和症状到诊断,提供治疗计划,和耐心咨询。
    目的:AI(ChatGPT与GoogleBard)在纤维瘤外科治疗中的应用。
    方法:使用ChatGPT和GoogleBard等LLM确定患者的问题,并在8种肌瘤临床情况下提供治疗选择。使用M.S.Excel进行数据输入,并使用M.S.Windows2010的社会科学统计软件包(SPSS26版)进行统计分析。所有结果以表格形式呈现。使用非参数检验卡方检验或Fisher精确检验分析数据。P值<0.05被认为具有统计学意义。计算了两种技术的灵敏度。我们使用科恩的Kappa来了解协议的程度。
    结果:我们发现在第一次尝试时,ChatGPT在62.5%的病例中给出了一般答案,在37.5%的病例中给出了具体答案。ChatGPT对连续提示的灵敏度提高了37.5%,对第三个提示的灵敏度提高了62.5%。GoogleBard在50%的案例中无法识别临床问题,在12.5%的案例中给出了错误的答案(p=0.04)。GoogleBard在所有提示中都显示了25%的相同敏感度。
    结论:AI有助于减少诊断和计划纤维瘤治疗策略的时间,并且是妇科医生手中的有力工具。然而,应避免患者使用AI进行自我治疗,并且应仅用于有关肌瘤的教育和咨询。
    BACKGROUND: In recent years, the integration ofArtificial intelligence (AI) into various fields of medicine including Gynaecology, has shown promising potential. Surgical treatment of fibroid is myomectomy if uterine preservation and fertility are the primary aims. AI usage begins with the involvement of LLM (Large Language Model) from the point when a patient visits a gynecologist, from identifying signs and symptoms to reaching a diagnosis, providing treatment plans, and patient counseling.
    OBJECTIVE: Use of AI (ChatGPT versus Google Bard) in the surgical management of fibroid.
    METHODS: Identifyingthe patient\'s problems using LLMs like ChatGPT and Google Bard and giving a treatment optionin 8 clinical scenarios of fibroid. Data entry was done using M.S. Excel and was statistically analyzed using Statistical Package for Social Sciences (SPSS Version 26) for M.S. Windows 2010. All results were presented in tabular form. Data were analyzed using nonparametric tests Chi-square tests or Fisher exact test.pvalues < 0.05 were considered statistically significant. The sensitivity of both techniques was calculated. We have used Cohen\'s Kappa to know the degree of agreement.
    RESULTS: We found that on the first attempt, ChatGPT gave general answers in 62.5 % of cases and specific answers in 37.5 % of cases. ChatGPT showed improved sensitivity on successive prompts 37.5 % to 62.5 % on the third prompt. Google Bard could not identify the clinical question in 50 % of cases and gave incorrect answers in 12.5 % of cases (p = 0.04). Google Bard showed the same sensitivity of 25 % on all prompts.
    CONCLUSIONS: AI helps to reduce the time to diagnose and plan a treatment strategy for fibroid and acts as a powerful tool in the hands of a gynecologist. However, the usage of AI by patients for self-treatment is to be avoided and should be used only for education and counseling about fibroids.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    偏头痛,一种频繁且高度致残的疾病,必须加强对偏头痛患者的教育,以减轻这一全球负担。快速发展的大型语言模型(LLM)领域为偏头痛患者的教育提供了一条有希望的途径。本研究旨在通过评估来自五个领先的LLM的响应的准确性来评估LLM在这种情况下的潜力,包括OpenAI的ChatGPT3.5和4.0,GoogleBard,MetaLlama2和AnthropicClaude2解决了30个常见的偏头痛相关查询。我们发现LLM表现出不同的准确性水平。ChatGPT-4.0提供了96.7%的适当响应,而其他聊天机器人提供了83.3%到90%的适当响应(皮尔逊的卡方检验,P=0.481)。此外,谷歌吟游诗人的评级比例为6.7%,其他LLM有3.3%(皮尔逊卡方检验,P=0.961)。这项研究强调了LLM准确解决常见偏头痛相关查询的潜力。这样的发现可以促进偏头痛患者的人工智能辅助教育,为偏头痛管理的整体方法提供见解。
    This study assessed the potential of large language models (OpenAI\'s ChatGPT 3.5 and 4.0, Google Bard, Meta Llama2, and Anthropic Claude2) in addressing 30 common migraine-related queries, providing a foundation to advance artificial intelligence-assisted patient education and insights for a holistic approach to migraine management.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:搜索在线健康信息是一种流行的方法,被患者用来提高他们对疾病的认识。最近开发的AI聊天机器人可能是这方面最简单的方法。该研究的目的是根据癌症患者中最常用的放射性核素治疗来分析AI聊天机器人反应的可靠性和可读性。
    方法:基本患者问题,大约三十RAI,PRRT和TARE治疗和29个关于PSMA-TRT,在2024年1月被一个接一个地问GPT-4和巴德。使用DISCERN量表评估回答的可靠性和可读性,Flesch阅读轻松(FRE)和Flesch-Kincaid阅读等级(FKRGL)。
    结果:GPT-4和GoogleBard对RAI的响应的平均(SD)FKRGL得分,PSMA-TRT,PRRT和TARE治疗为14.57(1.19),14.65(1.38),14.25(1.10),14.38(1.2)和11.49(1.59),12.42(1.71),11.35(1.80),13.01(1.97),分别。就可读性而言,GPT-4和GoogleBard对RAI的响应的FRKGL得分,PSMA-TRT,PRRT和TARE治疗高于一般公众阅读等级。核医学医师评估GPT-4和Bard对RAI的反应的平均(SD)DISCERN评分,PSMA-TRT,PRRT和TARE治疗为47.86(5.09),48.48(4.22),46.76(4.09),48.33(5.15)和51.50(5.64),53.44(5.42),53(6.36),49.43(5.32),分别。根据平均DISCERN分数,GPT-4和GoogleBard关于RAI的响应的可靠性,PSMA-TRT,PRRT,TARE治疗范围从一般到良好。GPT-4,Bard和核医学医师评估的DISCERN评分的评分者间可靠性相关系数,PSMA-TRT,PRRT和TARE治疗为0.512(95%CI0.296:0.704),0.695(95%CI0.518:0.829),0.687(95%CI0.511:0.823)和0.649(95%CI0.462:0.798),分别(p<0.01)。GPT-4,Bard和核医学医师评估的DISCERN评分的评分者间可靠性相关系数,PSMA-TRT,PRRT和TARE治疗为0.753(95%CI0.602:0.863),0.812(95%CI0.686:0.899),0.804(95%CI0.677:0.894)和0.671(95%CI0.489:0.812),分别(p<0.01)。Bard和GPT-4关于RA的响应的评分者间可靠性,PSMA-TRT,PRRT和TARE治疗中度至良好。Further,GPT-4和GoogleBard都很少强调对核医学医师的咨询,GoogleBard的一些回应中包含了参考文献,但在GPT-4中没有参考。
    结论:尽管AI聊天机器人提供的信息在医学上可能是可以接受的,它不容易为公众阅读,这可能会阻止它被理解。使用“提示工程”的有效提示可以以更易于理解的方式完善响应。由于放射性核素治疗是核医学专业知识特有的,为了指导患者和护理人员获得准确的医疗建议,核医学医师需要在回应中作为顾问。就寻求信息的患者和护理人员的信心和满意度而言,参考意义重大。
    OBJECTIVE: Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients.
    METHODS: Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL).
    RESULTS: The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4.
    CONCLUSIONS: Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using \'prompt engineering\' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于大型语言模型(LLM)的人工智能(AI)聊天机器人在医疗保健的许多方面的实用性越来越明显,尽管它们解决患者问题的能力仍然未知。我们试图评估两个著名的,可自由访问的聊天机器人,ChatGPT和GoogleBard,在回答由患者和他们的照顾者提出的关于中风康复的常见问题。
    我们通过调查收集了门诊患者及其护理人员的问题,按主题分类,并创建了要向两个聊天机器人提出的代表性问题。然后,我们根据准确性评估了聊天机器人的响应,安全,相关性,和可读性。还跟踪了评估者之间的协议。
    尽管两个聊天机器人的总分相似,GoogleBard在相关性和安全性方面的表现略好。两者都提供了具有一定总体准确性的可读响应,但是在幻觉反应中挣扎,通常不是具体的,并且缺乏对情绪情况可能变得危险的可能性的认识。此外,评价者间的协议很低,强调医生接受他们的反应的变异性。
    AI聊天机器人在面向患者的支持角色中显示出潜力,但是关于安全的问题仍然存在,准确度,和相关性。未来的聊天机器人应该解决这些问题,以确保它们能够可靠和独立地管理中风患者及其护理人员的关注和问题。
    UNASSIGNED: The utility of large language model-based (LLM) artificial intelligence (AI) chatbots in many aspects of healthcare is becoming apparent though their ability to address patient concerns remains unknown. We sought to evaluate the performance of two well-known, freely-accessible chatbots, ChatGPT and Google Bard, in responding to common questions about stroke rehabilitation posed by patients and their caregivers.
    UNASSIGNED: We collected questions from outpatients and their caregivers through a survey, categorised them by theme, and created representative questions to be posed to both chatbots. We then evaluated the chatbots\' responses based on accuracy, safety, relevance, and readability. Interrater agreement was also tracked.
    UNASSIGNED: Although both chatbots achieved similar overall scores, Google Bard performed slightly better in relevance and safety. Both provided readable responses with some general accuracy, but struggled with hallucinated responses, were often not specific, and lacked awareness of the possibility for emotional situations with the potential to turn dangerous. Additionally, interrater agreement was low, highlighting the variability in physician acceptance of their responses.
    UNASSIGNED: AI chatbots show potential in patient-facing support roles, but issues remain regarding safety, accuracy, and relevance. Future chatbots should address these problems to ensure that they can reliably and independently manage the concerns and questions of stroke patients and their caregivers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    患者越来越多地使用人工智能(AI)聊天机器人来寻求医疗查询的答案。
    向三个AI聊天机器人提出了麻醉中的十个常见问题:ChatGPT4(OpenAI),吟游诗人(谷歌),和Bing聊天(Microsoft)。每个聊天机器人的答案都是随机评估的,来自美国15家医疗机构的5名居留计划主任的盲令。三个医疗内容质量类别(准确性,全面性,安全)和三个通信质量类别(可理解性、同情/尊重,和道德)得分在1到5之间(1代表最差,5代表最佳)。
    ChatGPT4和Bard的表现优于BingChat(中位数[四分位数间]得分:4[3-4],4[3-4],3[2-4]分别;所有指标组合后P<0.001)。所有AI聊天机器人的准确性都很差(得分≥4,占58%,48%,和36%的专家ChatGPT4,巴德,和Bing聊天,分别),全面性(得分≥4,42%,30%,和12%的专家为ChatGPT4,Bard,和Bing聊天,分别),和安全性(50%的分数≥4,40%,还有28%的ChatGPT4、Bard、和Bing聊天,分别)。值得注意的是,ChatGPT4,Bard,和Bing聊天在全面性方面存在统计学差异(ChatGPT4,3[2-4]与Bing聊天,2[2-3],P<0.001;和吟游诗人3[2-4]vsBingChat,2[2-3],P=0.002)。所有大型语言模型聊天机器人都表现良好,可理解性没有统计学差异(P=0.24)。同理心(P=0.032),伦理(P=0.465)。
    在回答麻醉患者常见问题时,聊天机器人在通信指标上表现良好,但在医疗内容指标上表现欠佳。总的来说,ChatGPT4和Bard彼此相当,两者都胜过BingChat。
    UNASSIGNED: Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.
    UNASSIGNED: Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot\'s answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).
    UNASSIGNED: ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] vs Bing Chat, 2 [2-3], P<0.001; and Bard 3 [2-4] vs Bing Chat, 2 [2-3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465).
    UNASSIGNED: In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:评估人工智能(GoogleBard)在数字中的作用,扫描,通过客观结构化实践考试(OSPE)类型的表现,在医学教育和医疗保健科学中进行图像识别和解释。
    方法:OSPE类型的题库是由医学科学人物组成的,扫描,和图像。为了评估,60个数字,选择扫描和图像并将其输入到GoogleBard的给定区域以评估知识水平。
    结果:GoogleBard在大脑结构中获得的标记,形态学和放射学图像7/10(70%);骨骼结构,放射学图像9/10(90%);肝脏结构和形态学,病理图像4/10(40%);肾脏结构和形态学图像2/7(28.57%);神经放射学图像4/7(57.14%);和包括甲状腺在内的内分泌腺,胰腺,乳腺形态学和放射学图像8/16(50%)。GoogleBard在各种OSPE数据中获得的总体总分,扫描,图像识别问题为34/60(56.7%)。
    结论:GoogleBard在形态学上得分令人满意,组织病理学,和放射学图像识别及其解释。GoogleBard可能会帮助医学生,医学教育教师和医疗机构的医生。
    OBJECTIVE: To evaluate the role of artificial intelligence (Google Bard) in figures, scans, and image identifications and interpretations in medical education and healthcare sciences through an Objective Structured Practical Examination (OSPE) type of performance.
    METHODS: The OSPE type of question bank was created with a pool of medical sciences figures, scans, and images. For assessment, 60 figures, scans and images were selected and entered into the given area of the Google Bard to evaluate the knowledge level.
    RESULTS: The marks obtained by Google Bard in brain structures, morphological and radiological images 7/10 (70%); bone structures, radiological images 9/10 (90%); liver structure and morphological, pathological images 4/10 (40%); kidneys structure and morphological images 2/7 (28.57%); neuro-radiological images 4/7 (57.14%); and endocrine glands including the thyroid, pancreas, breast morphological and radiological images 8/16 (50%). The overall total marks obtained by Google Bard in various OSPE figures, scans, and image identification questions were 34/60 (56.7%).
    CONCLUSIONS: Google Bard scored satisfactorily in morphological, histopathological, and radiological image identifications and their interpretations. Google Bard may assist medical students, faculty in medical education and physicians in healthcare settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生成AI正在彻底改变医疗保健领域的患者教育,特别是通过提供个性化的聊天机器人,明确的医疗信息。可靠性和准确性在AI驱动的患者教育中至关重要。
    大型语言模型(LLM)的有效性,比如ChatGPT和GoogleBard,在提供关于腰椎间盘突出症的准确和可理解的患者教育方面?
    从133个问题中选择了10个关于腰椎间盘突出症的常见问题,并提交给3个LLM。六位经验丰富的脊柱外科医生对回答进行了从“优秀”到“不满意”的评分,“并评估了穷途末路的答案,清晰度,同理心,和长度。统计分析涉及FleissKappa,卡方,和弗里德曼测试。
    在回答中,27.2%是优秀的,43.9%令人满意,最少澄清,18.3%满意,适度澄清,10.6%不满意。LLM之间的总体评分没有显着差异(p=0.90);然而,没有实现评分者间的可靠性,在回答频率的分布上,评分者之间存在很大差异。总的来说,评分在10个答案中有所不同(p=0.043)。穷尽的平均评分,清晰度,同理心,长度在3.5/5以上。
    LLM在腰椎手术患者教育方面显示出潜力,评估人员普遍给予积极的反馈。新的欧盟人工智能法案对人工智能系统实施严格的监管,强调在医疗环境中需要严格的监督。在目前的研究中,评估的可变性和偶尔的不准确性强调了不断改进的必要性。未来的研究应该涉及更先进的模型,以加强患者与医生的沟通。
    UNASSIGNED: Generative AI is revolutionizing patient education in healthcare, particularly through chatbots that offer personalized, clear medical information. Reliability and accuracy are vital in AI-driven patient education.
    UNASSIGNED: How effective are Large Language Models (LLM), such as ChatGPT and Google Bard, in delivering accurate and understandable patient education on lumbar disc herniation?
    UNASSIGNED: Ten Frequently Asked Questions about lumbar disc herniation were selected from 133 questions and were submitted to three LLMs. Six experienced spine surgeons rated the responses on a scale from \"excellent\" to \"unsatisfactory,\" and evaluated the answers for exhaustiveness, clarity, empathy, and length. Statistical analysis involved Fleiss Kappa, Chi-square, and Friedman tests.
    UNASSIGNED: Out of the responses, 27.2% were excellent, 43.9% satisfactory with minimal clarification, 18.3% satisfactory with moderate clarification, and 10.6% unsatisfactory. There were no significant differences in overall ratings among the LLMs (p = 0.90); however, inter-rater reliability was not achieved, and large differences among raters were detected in the distribution of answer frequencies. Overall, ratings varied among the 10 answers (p = 0.043). The average ratings for exhaustiveness, clarity, empathy, and length were above 3.5/5.
    UNASSIGNED: LLMs show potential in patient education for lumbar spine surgery, with generally positive feedback from evaluators. The new EU AI Act, enforcing strict regulation on AI systems, highlights the need for rigorous oversight in medical contexts. In the current study, the variability in evaluations and occasional inaccuracies underline the need for continuous improvement. Future research should involve more advanced models to enhance patient-physician communication.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号