人工智能聊天机器人在免疫相关不良事件临床管理中的应用。Use of artificial intelligence chatbots in clinical management of immune-related adverse events.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Artificial intelligence (AI) chatbots have become a major source of general and medical information, though their accuracy and completeness are still being assessed. Their utility to answer questions surrounding immune-related adverse events (irAEs), common and potentially dangerous toxicities from cancer immunotherapy, are not well defined.
METHODS: We developed 50 distinct questions with answers in available guidelines surrounding 10 irAE categories and queried two AI chatbots (ChatGPT and Bard), along with an additional 20 patient-specific scenarios. Experts in irAE management scored answers for accuracy and completion using a Likert scale ranging from 1 (least accurate/complete) to 4 (most accurate/complete). Answers across categories and across engines were compared.
RESULTS: Overall, both engines scored highly for accuracy (mean scores for ChatGPT and Bard were 3.87 vs 3.5, p<0.01) and completeness (3.83 vs 3.46, p<0.01). Scores of 1-2 (completely or mostly inaccurate or incomplete) were particularly rare for ChatGPT (6/800 answer-ratings, 0.75%). Of the 50 questions, all eight physician raters gave ChatGPT a rating of 4 (fully accurate or complete) for 22 questions (for accuracy) and 16 questions (for completeness). In the 20 patient scenarios, the average accuracy score was 3.725 (median 4) and the average completeness was 3.61 (median 4).
CONCLUSIONS: AI chatbots provided largely accurate and complete information regarding irAEs, and wildly inaccurate information (\"hallucinations\") was uncommon. However, until accuracy and completeness increases further, appropriate guidelines remain the gold standard to follow.

摘要：

背景：人工智能（AI）聊天机器人已成为一般和医疗信息的主要来源，尽管它们的准确性和完整性仍在评估中。它们用于回答围绕免疫相关不良事件(irAE)的问题，癌症免疫疗法的常见和潜在危险毒性，没有很好的定义。
方法：我们开发了50个不同的问题，并在围绕10个irAE类别的可用指南中给出了答案，并查询了两个AI聊天机器人（ChatGPT和Bard）。以及另外20种针对患者的方案。irAE管理专家使用Likert量表对准确性和完成性进行评分，范围从1（最不准确/完整）到4（最准确/完整）。比较了各个类别和各个引擎的答案。
结果：总体而言，两个引擎在准确性（ChatGPT和Bard的平均得分分别为3.87vs3.5，p<0.01）和完整性（3.83vs3.46，p<0.01）方面得分很高。1-2的分数（完全或大部分不准确或不完整）对于ChatGPT特别罕见（6/800答案评分，0.75%）。在50个问题中，对于22个问题(准确性)和16个问题(完整性),所有8名医师评估者对ChatGPT的评分为4(完全准确或完整).在20个病人场景中，平均准确度评分为3.725(中位数4),平均完整性评分为3.61(中位数4).
结论：AI聊天机器人提供了有关iRAE的基本准确和完整的信息，而且极不准确的信息（“幻觉”）并不常见。然而,直到准确性和完整性进一步提高，适当的指导方针仍然是可以遵循的黄金标准。