关键词: GPT‐4 MIMIC‐IV clinical decision‐making diagnostic errors health care efficiency language model patient care

来  源:   DOI:10.1002/hcs2.79   PDF(Pubmed)

Abstract:
UNASSIGNED: Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.
UNASSIGNED: To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients\' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.
UNASSIGNED: Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses.
UNASSIGNED: The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.
摘要:
鉴于医院的诊断错误率高得惊人,以及大型语言模型(LLM)的最新发展,我们着手测量两种流行的LLM:GPT-4和PaLM2的诊断灵敏度.评估LLM诊断能力的小规模研究显示了有希望的结果,GPT-4在诊断测试用例方面表现出很高的准确性。然而,需要对真实电子患者数据进行更大的评估,以提供更可靠的估计.
为了填补文献中的这一空白,我们使用了一个去识别的电子健康记录(EHR)数据集,该数据集包含波士顿贝斯以色列女执事医疗中心收治的约30万名患者.这个数据集包含血液,成像,微生物学和生命体征信息以及患者的医疗诊断代码。根据现有的EHR数据,医生为每个病人策划了一套诊断,我们称之为地面真相诊断。然后,我们设计了精心编写的提示,以从LLM中获得患者的诊断预测,并将其与1000名患者的随机样本中的真实诊断进行比较。
根据正确预测的地面实况诊断的比例,我们估计GPT-4的诊断命中率为93.9%。PaLM2在相同数据集上达到84.7%。在这1000个随机选择的EHR上,GPT-4正确识别1116个独特的诊断。
结果表明,人工智能(AI)在与临床医生一起工作时具有减少认知错误的潜力,而认知错误每年导致成千上万的误诊。然而,人类对人工智能的监督仍然至关重要:LLM不能取代临床医生,尤其是当涉及到人类的理解和同情。此外,将人工智能纳入医疗保健存在大量挑战,包括伦理,责任和监管障碍。
公众号