GPT‐4

  • 文章类型: Journal Article
    大型语言模型(LLM)在评估先前发表的临床病例时具有很高的诊断准确性。
    我们比较了GPT-4对以前未发表的具有挑战性的病例方案的鉴别诊断的准确性与以前发表的病例的诊断准确性。
    对于一组以前未发表的具有挑战性的临床病例,GPT-4在其前6位诊断中的正确率为61.1%,而先前报道的医师为49.1%。对于一组45个更常见的临床场景的临床插图,GPT-4在100%的时间内将正确的诊断包括在其前3位诊断中,而先前报道的医生为84.3%。
    GPT-4的表现水平至少与,如果不是比,有经验的医生在内科非常具有挑战性的病例。GPT-4在诊断常见临床情景方面的非凡性能可以部分解释为这些病例先前已发布并且可能已包含在此LLM的训练数据集中。
    UNASSIGNED: Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.
    UNASSIGNED: We compared the accuracy of GPT-4\'s differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.
    UNASSIGNED: For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.
    UNASSIGNED: GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    鉴于医院的诊断错误率高得惊人,以及大型语言模型(LLM)的最新发展,我们着手测量两种流行的LLM:GPT-4和PaLM2的诊断灵敏度.评估LLM诊断能力的小规模研究显示了有希望的结果,GPT-4在诊断测试用例方面表现出很高的准确性。然而,需要对真实电子患者数据进行更大的评估,以提供更可靠的估计.
    为了填补文献中的这一空白,我们使用了一个去识别的电子健康记录(EHR)数据集,该数据集包含波士顿贝斯以色列女执事医疗中心收治的约30万名患者.这个数据集包含血液,成像,微生物学和生命体征信息以及患者的医疗诊断代码。根据现有的EHR数据,医生为每个病人策划了一套诊断,我们称之为地面真相诊断。然后,我们设计了精心编写的提示,以从LLM中获得患者的诊断预测,并将其与1000名患者的随机样本中的真实诊断进行比较。
    根据正确预测的地面实况诊断的比例,我们估计GPT-4的诊断命中率为93.9%。PaLM2在相同数据集上达到84.7%。在这1000个随机选择的EHR上,GPT-4正确识别1116个独特的诊断。
    结果表明,人工智能(AI)在与临床医生一起工作时具有减少认知错误的潜力,而认知错误每年导致成千上万的误诊。然而,人类对人工智能的监督仍然至关重要:LLM不能取代临床医生,尤其是当涉及到人类的理解和同情。此外,将人工智能纳入医疗保健存在大量挑战,包括伦理,责任和监管障碍。
    UNASSIGNED: Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.
    UNASSIGNED: To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients\' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.
    UNASSIGNED: Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses.
    UNASSIGNED: The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:肝脏成像报告和数据系统(LI-RADS)为肝细胞癌成像提供了一种标准化方法。然而,放射学报告的不同样式和结构使自动数据提取变得复杂。大型语言模型具有从自由文本报告中提取结构化数据的潜力。我们的目标是评估生成预训练变压器(GPT)-4从自由文本肝脏磁共振成像(MRI)报告中提取LI-RADS特征和类别的性能。
    方法:三位放射科医生用韩语和英语生成了160份虚构的自由文本肝脏MRI报告,模拟现实世界的实践。其中,20个用于即时工程,140人组成了内部测试队列.七十二份真实报告,由17名放射科医师撰写,我们对外部检测队列进行了收集和去识别.使用GPT-4提取LI-RADS特征,并使用Python脚本计算类别。比较每个测试队列的准确性。
    结果:在外部测试中,主要LI-RADS特征提取的准确性,包括大小,非边缘动脉期增快,非外围\'冲刷\'',增强“胶囊”和阈值增长,范围从.92到.99。对于其余的LI-RADS功能,精度范围从.86到.97。对于LI-RADS类别,该模型的准确性为.85(95%CI:.76,.93)。
    结论:GPT-4在提取LI-RADS特征方面显示出希望,进一步完善其提示策略和改进其神经网络架构对于可靠地处理复杂的真实世界MRI报告至关重要.
    OBJECTIVE: The Liver Imaging Reporting and Data System (LI-RADS) offers a standardized approach for imaging hepatocellular carcinoma. However, the diverse styles and structures of radiology reports complicate automatic data extraction. Large language models hold the potential for structured data extraction from free-text reports. Our objective was to evaluate the performance of Generative Pre-trained Transformer (GPT)-4 in extracting LI-RADS features and categories from free-text liver magnetic resonance imaging (MRI) reports.
    METHODS: Three radiologists generated 160 fictitious free-text liver MRI reports written in Korean and English, simulating real-world practice. Of these, 20 were used for prompt engineering, and 140 formed the internal test cohort. Seventy-two genuine reports, authored by 17 radiologists were collected and de-identified for the external test cohort. LI-RADS features were extracted using GPT-4, with a Python script calculating categories. Accuracies in each test cohort were compared.
    RESULTS: On the external test, the accuracy for the extraction of major LI-RADS features, which encompass size, nonrim arterial phase hyperenhancement, nonperipheral \'washout\', enhancing \'capsule\' and threshold growth, ranged from .92 to .99. For the rest of the LI-RADS features, the accuracy ranged from .86 to .97. For the LI-RADS category, the model showed an accuracy of .85 (95% CI: .76, .93).
    CONCLUSIONS: GPT-4 shows promise in extracting LI-RADS features, yet further refinement of its prompting strategy and advancements in its neural network architecture are crucial for reliable use in processing complex real-world MRI reports.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:有效的临床事件分类对于临床研究和质量改进至关重要。人工智能(AI)模型(如生成预训练变压器4(GPT-4))的验证以及与传统方法的比较仍未探索。
    方法:我们评估了GPT-4模型对200份医疗出院摘要的胃肠道(GI)出血事件进行分类的性能,并将结果与人类综述和基于国际疾病分类(ICD)代码的系统进行了比较。分析包括准确性,灵敏度,和特异性评估,使用由医师评审员确定的地面实况。
    结果:GPT-4在识别胃肠道出血发生方面的准确率为94.4%,优于ICD代码(准确率63.5%,P<0.001)。GPT-4的准确性略低或在统计上与单个人类审阅者相似(审阅者1:98.5%,P<0.001;审阅者2:90.8%,P=0.170)。对于位置分类,GPT-4对已确认和可能的消化道出血部位的准确率分别为81.7%和83.5%。分别,数字要么略低,要么与人类评论者相当。GPT-4是高效的,在12.7分钟内分析数据集,成本为21.2美元,而人类评论者每人需要8-9小时。
    结论:我们的研究表明GPT-4提供了可靠的,成本效益高,和更快的替代目前的临床事件分类方法,优于传统的ICD编码系统,并且与个人专家人类审阅者相比具有可比性。它的实施可以促进更准确和颗粒的临床研究和质量审核。未来的研究应该探索可扩展性,提示和模型调整,以及高性能AI模型在临床数据处理中的伦理意义。
    OBJECTIVE: Effective clinical event classification is essential for clinical research and quality improvement. The validation of artificial intelligence (AI) models like Generative Pre-trained Transformer 4 (GPT-4) for this task and comparison with conventional methods remains unexplored.
    METHODS: We evaluated the performance of the GPT-4 model for classifying gastrointestinal (GI) bleeding episodes from 200 medical discharge summaries and compared the results with human review and an International Classification of Diseases (ICD) code-based system. The analysis included accuracy, sensitivity, and specificity evaluation, using ground truth determined by physician reviewers.
    RESULTS: GPT-4 exhibited an accuracy of 94.4% in identifying GI bleeding occurrences, outperforming ICD codes (accuracy 63.5%, P < 0.001). GPT-4\'s accuracy was either slightly lower or statistically similar to individual human reviewers (Reviewer 1: 98.5%, P < 0.001; Reviewer 2: 90.8%, P = 0.170). For location classification, GPT-4 achieved accuracies of 81.7% and 83.5% for confirmed and probable GI bleeding locations, respectively, with figures that were either slightly lower or comparable with those of human reviewers. GPT-4 was highly efficient, analyzing the dataset in 12.7 min at a cost of 21.2 USD, whereas human reviewers required 8-9 h each.
    CONCLUSIONS: Our study indicates GPT-4 offers a reliable, cost-efficient, and faster alternative to current clinical event classification methods, outperforming the conventional ICD coding system and performing comparably to individual expert human reviewers. Its implementation could facilitate more accurate and granular clinical research and quality audits. Future research should explore scalability, prompt and model tuning, and ethical implications of high-performance AI models in clinical data processing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    ChatGPT是一种新的人工智能驱动的聊天机器人语言模型,能够帮助耳鼻喉科医师进行临床实践和研究。我们调查了ChatGPT-4在耳鼻喉科手稿编辑中的能力。四篇论文由非母语的英语耳鼻喉科医生撰写,并由专业编辑服务编辑。ChatGPT-4用于检测和纠正手稿中的错误。从手稿中的171个错误中,ChatGPT-4检测到86个错误(50.3%),包括词汇(N=36),确定器(N=27),介词(N=24),大写(N=20),和数量(N=11)。ChatGPT-4提出了对72个(83.7%)错误的适当修正,而一些错误检测不佳(例如,大写[5%]和词汇[44.4%]错误。ChatGPT-4声称改变了82例已经存在的东西。ChatGPT在识别某些类型的错误方面表现出有用性,但并非全部。非英语母语的研究人员应该意识到ChatGPT-4在手稿校对中的当前局限性。
    ChatGPT is a new artificial intelligence-powered language model of chatbot able to help otolaryngologists in clinical practice and research. We investigated the ability of ChatGPT-4 in the editing of a manuscript in otolaryngology. Four papers were written by a nonnative English otolaryngologist and edited by a professional editing service. ChatGPT-4 was used to detect and correct errors in manuscripts. From the 171 errors in the manuscripts, ChatGPT-4 detected 86 errors (50.3%) including vocabulary (N = 36), determiner (N = 27), preposition (N = 24), capitalization (N = 20), and number (N = 11). ChatGPT-4 proposed appropriate corrections for 72 (83.7%) errors, while some errors were poorly detected (eg, capitalization [5%] and vocabulary [44.4%] errors. ChatGPT-4 claimed to change something that was already there in 82 cases. ChatGPT demonstrated usefulness in identifying some types of errors but not all. Nonnative English researchers should be aware of the current limits of ChatGPT-4 in the proofreading of manuscripts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号