背景:人工智能,特别是聊天机器人系统,正在成为医疗保健的工具,帮助临床决策和患者参与。
目的:本研究旨在分析ChatGPT-3.5和ChatGPT-4在解决复杂的临床和伦理困境方面的表现,并说明他们在医疗保健决策中的潜在作用,同时比较老年人和居民的评级,和特定的问题类型。
方法:共有4名专业医师提出了176个现实世界的临床问题。共有8位资深医生和居民以1-5的量表评估了GPT-3.5和GPT-4的5个类别的回答:准确性,相关性,清晰度,实用程序,和全面性。在内科进行评估,急诊医学,和道德。在全球范围内进行了比较,在老年人和居民之间,跨分类。
结果:两种GPT模型均获得较高的平均得分(GPT-4为4.4,SD0.8,GPT-3.5为4.1,SD1.0)。GPT-4在所有评级维度上都优于GPT-3.5,老年人对这两种模式的反应始终高于居民。具体来说,老年人将GPT-4评为更有益和更完整(分别为4.6vs4.0和4.6vs4.1;P<.001),和GPT-3.5相似(分别为4.1vs3.7和3.9vs3.5;P<.001)。道德查询在这两种模型中都获得了最高的评价,平均分数反映了准确性和完整性标准的一致性。问题类型之间的区别是显著的,特别是对于整个紧急情况下的GPT-4完整性平均分数,内部,和伦理问题(分别为4.2,SD1.0;4.3,SD0.8;和4.5,SD0.7;P<.001),对于GPT-3.5的准确性,有益的,和完整性尺寸。
结论:ChatGPT帮助医生解决医疗问题的潜力是有希望的,具有增强诊断能力的前景,治疗,和道德。虽然整合到临床工作流程可能很有价值,它必须补充,不替换,人类的专业知识。持续的研究对于确保在临床环境中安全有效的实施至关重要。
BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement.
OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors\' and residents\' ratings, and specific question types.
METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications.
RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5\'s accuracy, beneficial, and completeness dimensions.
CONCLUSIONS: ChatGPT\'s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.