远程心理健康患者的危机预测：大型语言模型和专家临床医生比较。Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction.
OBJECTIVE: This study aimed to evaluate the performance of LLMs, specifically OpenAI\'s generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform.
METHODS: Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients\' self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians.
RESULTS: Overall, the clinicians\' average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians\' average (0.53). The addition of suicide attempt history increased the clinicians\' average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48).
CONCLUSIONS: GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients.

摘要：

背景：由于人工智能（AI）的最新进展，大型语言模型（LLM）已经成为各种语言相关任务的强大工具，包括情绪分析，以及提供者与患者互动的总结。然而,在危机预测领域，对这些模型的研究有限。
目的：本研究旨在评估LLM的性能，特别是OpenAI的GPT-4，在预测当前和未来的精神健康危机事件时，使用患者在国家远程医疗平台的用户之间的摄入量提供的信息。
方法：从Brightside远程医疗平台的特定摄入问题中提取去识别患者提供的数据，包括主要投诉，对于140名表示自杀意念（SI）的患者，另外120名患者后来在治疗过程中出现SI计划。在同一时间段内随机选择的200名从未认可SI的患者也获得了类似的数据。6名Brightside高级临床医生（3名心理学家和3名精神科医生）接受了患者自我报告的主诉和自我报告的自杀未遂史，但对未来的治疗过程和包括SI在内的其他报告症状视而不见。他们被问到一个简单的是/否问题，关于他们对SI与计划的认可的预测以及他们对预测的信心水平。GPT-4提供了类似的信息，并要求回答相同的问题，使我们能够直接比较人工智能和临床医生的表现。
结果：总体而言，临床医生在确定SI时的平均精度（0.698）高于GPT-4（0.596）与计划（n=140）。单独使用主诉时无SI(n=200)，而GPT-4的敏感性（0.621）高于临床医生的平均水平（0.529）。增加自杀未遂史增加了临床医生的平均敏感度（0.590）和精确度（0.765），同时提高GPT-4灵敏度（0.590），但降低GPT-4精度（0.544）。在预测具有计划的未来SI（n=120）与无SI（n=200）时，性能相对下降，仅针对临床医生（平均灵敏度=0.399；平均精度=0.594）和GPT-4（灵敏度=0.458；精度=0.482）。增加自杀未遂史可以提高临床医生的表现（平均灵敏度=0.457；平均精度=0.687）和GPT-4（灵敏度=0.742；精度=0.476）。
结论：GPT-4采用简单的即时设计，在一些指标上产生的结果接近受过训练的临床医生。在这种模型可以在临床环境中试用之前，必须做其他工作。该模型应该进行安全检查的偏见，因为有证据表明LLM可以使他们训练的基础数据的偏见永存。我们相信，LLM有望在摄入时增强对高风险患者的识别，并有可能为患者提供更及时的护理。
背景：