医学人工智能聊天机器人的参考幻觉评分：开发和可用性研究。Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.
OBJECTIVE: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots\' citations.
METHODS: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference\'s relevance to prompts\' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.
RESULTS: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001).
CONCLUSIONS: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots\' RHS could contribute to ongoing efforts to enhance AI\'s general reliability in medical research.

摘要：

背景：人工智能（AI）聊天机器人最近已被医疗保健从业人员用于医疗实践。有趣的是,这些人工智能聊天机器人的输出被发现在内容和参考文献上有不同程度的幻觉。这种幻觉会产生对其输出和实施的怀疑。
目的：我们研究的目的是提出一个参考幻觉评分（RHS）来评估AI聊天机器人引文的真实性。
方法：六个人工智能聊天机器人受到了相同的10个医学提示的挑战，每个提示请求10个引用。RHS由6个书目项目和参考与提示关键字的相关性组成。计算每个参考的RHS，提示,和类型的提示（基本与复杂）。计算每个AI聊天机器人的平均RHS，并在不同类型的提示和AI聊天机器人之间进行比较。
结果：Bard未能生成任何引用。ChatGPT3.5和Bing产生了最高的RHS（得分=11），而Elicit和SciSpace产生的RHS最低(得分=1)，困惑产生了一个中间的RHS（得分=7）。与提示关键字的参考相关性观察到最高程度的幻觉（308/500，61.6％），而最低的是参考标题(169/500，33.8%)。ChatGPT和Bing具有可比的RHS（β系数=-0.069；P=0.32），而困惑的RHS显著低于ChatGPT(β系数=-0.345；P<.001)。当使用场景或复杂格式提示时，AI聊天机器人通常具有更高的RHS（β系数=0.486；P<.001）。
结论：RHS的变化强调了需要一个强大的参考评估工具来提高AI聊天机器人的真实性。Further,这些变化突出了验证其输出和引用的重要性。Elicit和SciSpace的幻觉可以忽略不计，而ChatGPT和Bing有严重的幻觉水平。拟议的AI聊天机器人“RHS”可以为正在进行的努力做出贡献，以提高AI在医学研究中的总体可靠性。