publicly available

公开可用
  • 文章类型: Journal Article
    背景:医学文献在临床实践中起着至关重要的作用,促进准确的患者管理和卫生保健专业人员之间的沟通。然而,医疗笔记中的不准确会导致误解和诊断错误。此外,文件的要求有助于医生倦怠。尽管医疗抄写员和语音识别软件等中介已经被用来减轻这种负担,它们在准确性和解决特定于提供商的指标方面有局限性。环境人工智能(AI)支持的解决方案的集成提供了一种有希望的方式来改进文档,同时无缝地融入现有的工作流程。
    目的:本研究旨在评估主观,Objective,评估,和AI模型ChatGPT-4生成的计划(SOAP)注释,使用既定的历史和体格检查成绩单作为黄金标准。我们试图识别潜在的错误,并评估不同类别的模型性能。
    方法:我们进行了代表各种门诊专业的模拟患者-提供者相遇,并转录了音频文件。确定了关键的可报告元素,ChatGPT-4用于根据这些转录本生成SOAP注释。创建了每个注释的三个版本,并通过图表审查与黄金标准进行了比较;比较产生的错误被归类为遗漏,不正确的信息,或添加。我们比较了不同版本数据元素的准确性,转录本长度,和数据类别。此外,我们使用医师文档质量仪器(PDQI)评分系统评估笔记质量.
    结果:尽管ChatGPT-4始终生成SOAP风格的注释,有,平均而言,23.6每个临床病例的错误,遗漏错误(86%)是最常见的,其次是添加错误(10.5%)和包含不正确的事实(3.2%)。同一案例的重复之间存在显着差异,在所有3个重复中,只有52.9%的数据元素报告正确。数据元素的准确性因案例而异,在“目标”部分中观察到最高的准确性。因此,纸币质量的衡量标准,由PDQI评估,显示了病例内和病例间的差异。最后,ChatGPT-4的准确性与转录本长度(P=.05)和可评分数据元素的数量(P=.05)呈负相关。
    结论:我们的研究揭示了错误的实质性差异,准确度,和由ChatGPT-4产生的注释质量。错误不限于特定部分,和错误类型的不一致复制复杂的可预测性。成绩单长度和数据复杂度与音符准确度成反比,这引起了人们对该模式在处理复杂医疗案件中的有效性的担忧。ChatGPT-4产生的临床笔记的质量和可靠性不符合临床使用所需的标准。尽管AI在医疗保健领域充满希望,在广泛采用之前,应谨慎行事。需要进一步的研究来解决准确性问题,可变性,和潜在的错误。ChatGPT-4,虽然在各种应用中很有价值,目前不应该被认为是人类产生的临床文件的安全替代品。
    BACKGROUND: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows.
    OBJECTIVE: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model\'s performance across different categories.
    METHODS: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system.
    RESULTS: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the \"Objective\" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05).
    CONCLUSIONS: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model\'s effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Letter
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Systematic Review
    背景:机器学习(ML)可能会改善重症监护环境中的临床决策,但是数据集中的内在偏差会将偏差引入预测模型。这项研究旨在确定公开可用的重症监护数据集是否提供相关信息,以识别历史上被边缘化的人群。
    方法:我们进行了一项审查,以确定报告使用可公开访问的重症监护电子病历(EMR)数据集进行ML算法训练/验证的手稿。对数据集进行了审查,以确定以下12个变量是否可用:年龄,性别,性别认同,种族和/或种族,作为土著人的自我认同,付款人,主要语言,宗教,居住地,教育,职业,和收入。
    结果:确定了7个公开可用的数据库。重症监护医疗信息集市(MIMIC)报告了12个感兴趣变量中的7个的信息,信息系统(SIVEP-Gripe)在7上,COVID-19墨西哥开放资料库在4上,eICU在4上。其他数据集报告关于2个或更少的变量的信息。所有7个数据库都包含有关性别和年龄的信息。四个数据库(57%)包括有关患者是否被确定为本地或本地的信息。只有3(43%)包含有关种族和/或种族的数据。两个数据库(29%)包括有关居住地的信息,其中一个(14%)包括关于付款人的信息,语言,和宗教。一个数据库(14%)包括有关教育和患者职业的信息。没有数据库包括关于性别认同和收入的信息。
    结论:这篇综述表明,用于训练AI算法的重症监护公开数据没有包含足够的信息来正确寻找历史边缘化人群的内在偏见和公平性问题。
    Machine learning (ML) may improve clinical decision-making in critical care settings, but intrinsic biases in datasets can introduce bias into predictive models. This study aims to determine if publicly available critical care datasets provide relevant information to identify historically marginalized populations.
    We conducted a review to identify the manuscripts that report the training/validation of ML algorithms using publicly accessible critical care electronic medical record (EMR) datasets. The datasets were reviewed to determine if the following 12 variables were available: age, sex, gender identity, race and/or ethnicity, self-identification as an indigenous person, payor, primary language, religion, place of residence, education, occupation, and income.
    7 publicly available databases were identified. Medical Information Mart for Intensive Care (MIMIC) reports information on 7 of the 12 variables of interest, Sistema de Informação de Vigilância Epidemiológica da Gripe (SIVEP-Gripe) on 7, COVID-19 Mexican Open Repository on 4, and eICU on 4. Other datasets report information on 2 or fewer variables. All 7 databases included information about sex and age. Four databases (57%) included information about whether a patient identified as native or indigenous. Only 3 (43%) included data about race and/or ethnicity. Two databases (29%) included information about residence, and one (14%) included information about payor, language, and religion. One database (14%) included information about education and patient occupation. No databases included information on gender identity and income.
    This review demonstrates that critical care publicly available data used to train AI algorithms do not include enough information to properly look for intrinsic bias and fairness issues towards historically marginalized populations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号