背景:心态,这是人类认知过程不可或缺的一部分,涉及对自己和他人精神状态的解释,包括情绪,信仰,和意图。随着人工智能(AI)的出现以及大型语言模型在心理健康应用中的突出地位,关于他们在情感理解方面的能力的问题仍然存在。来自OpenAI的大型语言模型的先前迭代,ChatGPT-3.5,展示了从文本数据中解释情绪的高级能力,超越人类基准。鉴于ChatGPT-4的引入,其增强的视觉处理能力,考虑到GoogleBard现有的视觉功能,有必要对他们的视觉思维能力进行严格评估。
目的:研究的目的是批判性地评估ChatGPT-4和GoogleBard在辨别视觉思维指标方面的能力,并与基于文本的思维能力进行对比。
方法:由Baron-Cohen及其同事开发的“眼睛阅读”测试用于评估模型在解释视觉情绪指标方面的熟练程度。同时,情感意识水平量表用于评估大型语言模型在文本思维中的能力。整理来自两个测试的数据提供了ChatGPT-4和Bard的思维能力的整体视图。
结果:ChatGPT-4,表现出明显的情感识别能力,在两次不同的评估中获得了26分和27分,显著偏离随机响应范式(P<.001)。这些分数与更广泛的人类人口的既定基准一致。值得注意的是,ChatGPT-4表现出一致的反应,与模型的性别或情感的性质没有明显的偏见。相比之下,GoogleBard的性能与随机响应模式一致,确保10分和12分,并使进一步的详细分析变得多余。在文本分析领域,ChatGPT和Bard都超过了普通人群的既定基准,他们的表演非常一致。
结论:ChatGPT-4证明了其在视觉指导领域的功效,与人类的表现标准密切相关。尽管两种模型在文本情感解释中都表现出值得称赞的敏锐度,巴德在视觉情感解释方面的能力需要进一步的审查和潜在的改进。这项研究强调了道德人工智能发展对情感识别的重要性,强调对包容性数据的需求,与患者和心理健康专家合作,和严格的政府监督,以确保透明度和保护患者隐私。
BACKGROUND: Mentalization, which is integral to human cognitive processes, pertains to the interpretation of one\'s own and others\' mental states, including emotions, beliefs, and intentions. With the advent of artificial intelligence (AI) and the prominence of large language models in mental health applications, questions persist about their aptitude in emotional comprehension. The prior iteration of the large language model from OpenAI, ChatGPT-3.5, demonstrated an advanced capacity to interpret emotions from textual data, surpassing human benchmarks. Given the introduction of ChatGPT-4, with its enhanced visual processing capabilities, and considering Google Bard\'s existing visual functionalities, a rigorous assessment of their proficiency in visual mentalizing is warranted.
OBJECTIVE: The aim of the research was to critically evaluate the capabilities of ChatGPT-4 and Google Bard with regard to their competence in discerning visual mentalizing indicators as contrasted with their textual-based mentalizing abilities.
METHODS: The Reading the Mind in the Eyes Test developed by Baron-Cohen and colleagues was used to assess the models\' proficiency in interpreting visual emotional indicators. Simultaneously, the Levels of Emotional Awareness Scale was used to evaluate the large language models\' aptitude in textual mentalizing. Collating data from both tests provided a holistic view of the mentalizing capabilities of ChatGPT-4 and Bard.
RESULTS: ChatGPT-4, displaying a pronounced ability in emotion recognition, secured scores of 26 and 27 in 2 distinct evaluations, significantly deviating from a random response paradigm (P<.001). These scores align with established benchmarks from the broader human demographic. Notably, ChatGPT-4 exhibited consistent responses, with no discernible biases pertaining to the sex of the model or the nature of the emotion. In contrast, Google Bard\'s performance aligned with random response patterns, securing scores of 10 and 12 and rendering further detailed analysis redundant. In the domain of textual analysis, both ChatGPT and Bard surpassed established benchmarks from the general population, with their performances being remarkably congruent.
CONCLUSIONS: ChatGPT-4 proved its efficacy in the domain of visual mentalizing, aligning closely with human performance standards. Although both models displayed commendable acumen in textual emotion interpretation, Bard\'s capabilities in visual emotion interpretation necessitate further scrutiny and potential refinement. This study stresses the criticality of ethical AI development for emotional recognition, highlighting the need for inclusive data, collaboration with patients and mental health experts, and stringent governmental oversight to ensure transparency and protect patient privacy.