ChatGPT - 4 和 Gemini Ultra 1.0 在急诊医疗服务胸痛呼叫中用于质量保证审查的性能。The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI\'s ChatGPT-4 and Google\'s Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.
UNASSIGNED: Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.
UNASSIGNED: Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra\'s evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06).
UNASSIGNED: Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

摘要：

目的：本研究评估了可行性，评分者间的可靠性，以及使用OpenAI的ChatGPT-4和Google的Gemini超大型语言模型（LLM）的准确性，紧急医疗服务（EMS）质量保证。实施这些EMS质量保证的LLM有可能通过自动化处理和审查患者护理报告的各个方面来大大减少医疗主管和质量保证人员的工作量。这提供了更有效和准确的潜力，并确定需要改进的领域，从而潜在地提高患者护理结果方法：两名专家人类审查员，ChatGPTGPT-4和GeminiUltra评估并评估了来自2个大型城市EMS机构的150份连续采样和匿名的院前记录，以遵守2020年美国国家协会EMS心脏护理指标。我们评估了评分的准确性，评分者间的可靠性，和审查效率。使用kappa统计量来衡量每个EMS度量的二分结果的评估者间可靠性。结果：人类评论者显示出较高的评价者间可靠性，具有91.2%的一致性和卡帕系数，0.782（0.654-0.910）。ChatGPT-4在EKG文档和阿司匹林管理方面与人类审阅者达成了实质性协议（76.2％的协议，卡帕系数，0.401（0.334-0.468），但性能因其他指标而异。GeminiUltra的评估因性能不佳而中断。在每次人类图表审查的中位审查时间：01:28分钟（IQR1:12-1:51分钟）中没有观察到显着差异，每个ChatGPT-4图表审查01:24分钟（IQR01:09-01:53分钟）（p=0.46），和01:50分钟（IQR01:10-03:34分钟）每个双子座超审查（p=0.06）。结论：大型语言模型通过有效和客观地提取数据元素，显示出支持质量保证的潜力。然而,他们解释非标准化和时间敏感细节的准确性仍然不如人类评估者。我们的研究结果表明，当前的LLM可能最好为人类审查过程提供补充支持，但它们的价值仍然有限。建议增强LLM培训和集成，以提高质量保证过程中的性能和更可靠的性能。