■准确的医疗建议对于确保最佳的患者护理至关重要,和错误信息可能导致错误的决定,潜在的有害健康结果。诸如OpenAI的GPT-4之类的大型语言模型(LLM)的出现激发了人们对其潜在医疗保健应用的兴趣,特别是在自动医疗咨询中。然而,严格的调查将他们的表现与人类专家进行比较仍然很少。
■这项研究旨在将GPT-4的医疗准确性与使用真实世界用户生成的查询提供医疗建议的人类专家进行比较,特别关注心脏病学。它还试图分析GPT-4和人类专家在特定问题类别中的表现,包括药物或药物信息和初步诊断。
■我们通过互联网门户收集了来自一般用户的251对心脏病学特定问题和来自人类专家的回答。GPT-4的任务是生成对相同问题的响应。三名独立心脏病专家(SL,JHK,和JJC)评估了人类专家和GPT-4提供的答案。使用计算机接口,每个评估者比较了这些对,并确定哪个答案更优越,他们定量地测量了问题的清晰度和复杂性以及回答的准确性和适当性,应用三级分级量表(低,中等,和高)。此外,我们进行了语言分析,使用字数和类型-标记比比较回答的长度和词汇多样性.
■GPT-4和人类专家在医疗准确性方面表现出可比的功效(132/251的“GPT-4更好”为52.6%,而119/251的“人类专家更好”为47.4%)。在准确度等级分类中,人类的高准确度应答高于GPT-4(50/237,21.1%vs30/238,12.6%),但低准确度应答的比例也更高(11/237,4.6%vs1/238,0.4%;P=.001).与人类专家相比,GPT-4的反应通常更长,并且使用的词汇量较少,可能增强一般用户的可理解性(句子计数:平均10.9,SD4.2与平均5.9,SD3.7;P<.001;类型令牌比:平均0.69,SD0.07与平均0.79,SD0.09;P<.001)。然而,人类专家在特定问题类别中的表现优于GPT-4,特别是那些与药物或药物信息和初步诊断有关的信息。这些发现强调了GPT-4在根据临床经验提供建议方面的局限性。
■GPT-4在自动医疗咨询中显示出了有希望的潜力,具有与人类专家相当的医疗准确性。然而,挑战仍然存在,尤其是在微妙的临床判断领域。LLM的未来改进可能需要整合特定的临床推理途径和监管机构,以确保安全使用。需要进一步的研究来了解LLM在各种医疗专业和条件下的全部潜力。
UNASSIGNED: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI\'s GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse.
UNASSIGNED: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses.
UNASSIGNED: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio.
UNASSIGNED: GPT-4 and human experts displayed comparable efficacy in medical accuracy (\"GPT-4 is better\" at 132/251, 52.6% vs \"Human expert is better\" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience.
UNASSIGNED: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.