背景:为诊断认知障碍而定制的人工智能模型已显示出出色的结果。然而,尚不清楚大型语言模型是否可以仅通过文本与专门模型竞争。
目的:在本研究中,我们探讨了ChatGPT在轻度认知障碍(MCI)初筛中的表现,并对提示的设计步骤和组成部分进行了标准化.
方法:我们从DementiaBank筛查中收集了总共174名参与者,并将其中70%的参与者分类为训练集,将其中30%的参与者分类为测试集。只保留文本对话。使用宏代码清理句子,然后是手动检查。该提示由5个主要部分组成,包括字符设置,评分系统设置,指标设置,输出设置,和解释性信息设置。包括已发表研究的变量的三个维度:词汇(即,词频和词比,短语频率和短语比例,和词汇复杂性),语法和语法(即,句法复杂性和语法成分),和语义(即,语义密度和语义连贯)。我们使用了R4.3.0。用于分析变量和诊断指标。
结果:与MCI严重程度相关的三个额外指标被纳入模型的最终提示中。这些指标有效区分MCI和认知正常参与者:舌尖现象(P<.001),有复杂想法的困难(P<.001),和内存问题(P<.001)。最终的GPT-4模型在训练集上的灵敏度为0.8636,特异性为0.9487,曲线下面积为0.9062;在测试集上,灵敏度,特异性,曲线下面积分别达到0.7727、0.8333和0.8030。
结论:ChatGPT在初步筛查可能患有MCI的参与者中是有效的。临床医生改进的提示标准化也将改进模型的性能。重要的是要注意,ChatGPT不能代替临床医生进行诊断。
Artificial intelligence models tailored to diagnose cognitive impairment have shown excellent results. However, it is unclear whether large linguistic models can rival specialized models by text alone.
In this study, we explored the performance of ChatGPT for primary screening of mild cognitive impairment (MCI) and standardized the design steps and components of the prompts.
We gathered a total of 174 participants from the DementiaBank screening and classified 70% of them into the training set and 30% of them into the test set. Only text dialogues were kept. Sentences were cleaned using a macro code, followed by a manual check. The prompt consisted of 5 main parts, including character setting, scoring system setting, indicator setting, output setting, and explanatory information setting. Three dimensions of variables from published studies were included: vocabulary (ie, word frequency and word ratio, phrase frequency and phrase ratio, and lexical complexity), syntax and grammar (ie, syntactic complexity and grammatical components), and semantics (ie, semantic density and semantic coherence). We used R 4.3.0. for the analysis of variables and diagnostic indicators.
Three additional indicators related to the severity of MCI were incorporated into the final prompt for the model. These indicators were effective in discriminating between MCI and cognitively normal participants: tip-of-the-tongue phenomenon (P<.001), difficulty with complex ideas (P<.001), and memory issues (P<.001). The final GPT-4 model achieved a sensitivity of 0.8636, a specificity of 0.9487, and an area under the curve of 0.9062 on the training set; on the test set, the sensitivity, specificity, and area under the curve reached 0.7727, 0.8333, and 0.8030, respectively.
ChatGPT was effective in the primary screening of participants with possible MCI. Improved standardization of prompts by clinicians would also improve the performance of the model. It is important to note that ChatGPT is not a substitute for a clinician making a diagnosis.