关键词: ChatGPT ChatGPT-3.5 ChatGPT-4.0 LLM accuracy app application diagnose diagnosis emergency emergency patient language model machine learning self-diagnose self-diagnosis symptom checker triage

Mesh : Humans Triage / methods Emergency Service, Hospital Physicians Health Personnel Self Report

来  源:   DOI:10.2196/49995   PDF(Pubmed)

Abstract:
Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.
The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.
We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated \"unsafe\" or \"too cautious.\"
Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).
ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.
摘要:
背景:诊断是有效医疗保健的核心组成部分,但是误诊很常见,会使患者处于危险之中。诊断决策支持系统可以在改善医生和其他医护人员的诊断方面发挥作用。症状检查程序(SC)旨在改善诊断和分诊(即,患者寻求的护理水平)。
目的:本研究的目的是评估新的大型语言模型ChatGPT(版本3.5和4.0)的性能,广泛使用的WebMDSC,和AdaHealth开发的SC,用于诊断和分诊有紧急或紧急临床问题的患者,并与最终急诊科(ED)诊断和医师审查进行比较。
方法:我们使用以前收集的,被取消身份,来自40名接受ED治疗的患者的自我报告数据,这些患者在看ED医生之前使用AdaSC记录他们的症状.由不了解诊断和分类的研究助理将鉴定的数据输入到ChatGPT3.5和4.0版以及WebMD中。将所有4个系统的诊断与ED中先前抽象的最终诊断以及三名独立的委员会认证的ED医生的诊断和分诊建议进行了比较,他们盲目地审查了Ada的自我报告临床数据。诊断准确性计算为ChatGPT诊断的比例,AdaSC,WebMDSC,和至少一个ED诊断匹配的独立医师(分层为前1名或前3名)。分类准确度计算为ChatGPT的建议数量,WebMD,或与至少2名独立医生达成一致或被评为“不安全”或“过于谨慎”的Ada。\"
结果:总体而言,30例和37例有足够的数据进行诊断和分诊分析,分别。Ada的前1诊断率匹配,ChatGPT3.5,ChatGPT4.0,WebMD为9(30%),12(40%),10(33%),和12(40%),分别,医生的平均比率为47%。Ada的前3名诊断匹配率,ChatGPT3.5,ChatGPT4.0和WebMD为19(63%),19(63%),15(50%),和17(57%),分别,医生的平均比率为69%。Ada的分诊结果分布为62%(n=23)同意,14%不安全(n=5),24%(n=9)过于谨慎;对于ChatGPT,3.5是59%(n=22)同意,41%(n=15)不安全,0%(n=0)过于谨慎;对于ChatGPT4.0,76%(n=28)同意,22%(n=8)不安全,3%(n=1)过于谨慎;对于WebMD,70%(n=26)同意,19%(n=7)不安全,和11%(n=4)过于谨慎。ChatGPT3.5的不安全分诊率(41%)显着高于Ada(14%)(P=.009)。
结论:ChatGPT3.5诊断准确率高,但不安全分诊率高。ChatGPT4.0的诊断准确性最差,但较低的不安全分诊率和与医生的最高分诊协议。Ada和WebMDSC的总体表现优于ChatGPT。在不改善分诊准确性和广泛临床评估的情况下,不建议患者在无监督下使用ChatGPT进行诊断和分诊。
公众号