人工智能聊天机器人大型语言模型的性能，以解决骨骼生物学和骨骼健康查询。The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.
Artificial intelligence chatbots are increasingly used as a source of information in health care and research settings due to their accessibility and ability to summarize complex topics using conversational language. However, it is still unclear whether they can provide accurate information for questions related to the medicine and biology of the skeleton. Here, we tested the performance of three prominent chatbots—ChatGPT, Bard, and BingAI—by tasking them with a series of prompts based on well-established skeletal biology concepts, realistic physician–patient scenarios, and potential patient questions. Despite their similarities in function, differences in the accuracy of responses were observed across the three different chatbot services. While in some contexts, chatbots performed well, and in other cases, strong limitations were observed, including inconsistent consideration of clinical context and patient demographics, occasionally providing incorrect or out-of-date information, and citation of inappropriate sources. With careful consideration of their current weaknesses, artificial intelligence chatbots offer the potential to transform education on skeletal health and science.

摘要：

利用大型语言模型（LLM）的人工智能（AI）聊天机器人最近引起了极大的兴趣，因为它们能够以交互式对话格式对用户查询生成类似人类的响应。虽然这些模型越来越多地被患者用来获取医疗信息，科学和医疗提供者，和受训人员解决生物医学问题，他们的表现可能因领域而异。这些聊天机器人对骨骼健康和科学的广泛理解所带来的机遇和风险是未知的。在这里，我们评估了3个备受瞩目的LLM聊天机器人的性能，聊天生成预培训变压器(ChatGPT)4.0、BingAI、和Bard,解决3类30个问题：基础和转化骨骼生物学，骨骼疾病的临床医生管理，和患者查询，以评估回答的准确性和质量。在每个类别中提出了30个问题，并且由4名评审员独立对回答的准确度进行分级。虽然每个聊天机器人通常都能够提供有关骨骼疾病的相关信息，这些回应的质量和相关性差异很大，ChatGPT4.0和ChatGPT4.0在每个类别中的总体中位数得分最高.这些聊天机器人中的每一个都表现出不同的限制，包括不一致，不完整,或者不相关的回应，在专业背景下不适当地利用非专业资源，在提供建议时未能考虑患者的人口统计学或临床背景，并且无法一致地识别相关文献中的不确定性领域。需要仔细考虑当前AI聊天机器人的机会和风险，以制定最佳实践指南，将其用作骨骼健康和生物学信息的来源。
人工智能聊天机器人越来越多地用作医疗保健和研究环境中的信息来源，因为它们具有可访问性和使用对话语言总结复杂主题的能力。然而,目前还不清楚他们是否能为骨骼的医学和生物学相关问题提供准确的信息。这里,我们测试了三个著名的聊天机器人ChatGPT的性能，巴德,和BingAI-通过根据完善的骨骼生物学概念对他们进行一系列提示，现实的医患情景，和潜在的病人问题。尽管它们在功能上相似，在三个不同的聊天机器人服务中观察到了响应准确性的差异。在某些情况下,聊天机器人表现很好，在其他情况下，观察到强烈的局限性，包括对临床背景和患者人口统计学的不一致考虑，偶尔提供不正确或过时的信息，引用不适当的来源。仔细考虑他们目前的弱点，人工智能聊天机器人提供了改变骨骼健康和科学教育的潜力。