Bing

  • 文章类型: Journal Article
    背景:人工智能(AI)聊天机器人最近已被医疗保健从业人员用于医疗实践。有趣的是,这些人工智能聊天机器人的输出被发现在内容和参考文献上有不同程度的幻觉。这种幻觉会产生对其输出和实施的怀疑。
    目的:我们研究的目的是提出一个参考幻觉评分(RHS)来评估AI聊天机器人引文的真实性。
    方法:六个人工智能聊天机器人受到了相同的10个医学提示的挑战,每个提示请求10个引用。RHS由6个书目项目和参考与提示关键字的相关性组成。计算每个参考的RHS,提示,和类型的提示(基本与复杂)。计算每个AI聊天机器人的平均RHS,并在不同类型的提示和AI聊天机器人之间进行比较。
    结果:Bard未能生成任何引用。ChatGPT3.5和Bing产生了最高的RHS(得分=11),而Elicit和SciSpace产生的RHS最低(得分=1),困惑产生了一个中间的RHS(得分=7)。与提示关键字的参考相关性观察到最高程度的幻觉(308/500,61.6%),而最低的是参考标题(169/500,33.8%)。ChatGPT和Bing具有可比的RHS(β系数=-0.069;P=0.32),而困惑的RHS显著低于ChatGPT(β系数=-0.345;P<.001)。当使用场景或复杂格式提示时,AI聊天机器人通常具有更高的RHS(β系数=0.486;P<.001)。
    结论:RHS的变化强调了需要一个强大的参考评估工具来提高AI聊天机器人的真实性。Further,这些变化突出了验证其输出和引用的重要性。Elicit和SciSpace的幻觉可以忽略不计,而ChatGPT和Bing有严重的幻觉水平。拟议的AI聊天机器人“RHS”可以为正在进行的努力做出贡献,以提高AI在医学研究中的总体可靠性。
    BACKGROUND: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.
    OBJECTIVE: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots\' citations.
    METHODS: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference\'s relevance to prompts\' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.
    RESULTS: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001).
    CONCLUSIONS: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots\' RHS could contribute to ongoing efforts to enhance AI\'s general reliability in medical research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究探讨了AI聊天机器人提供的医疗保健信息中的差异和机会。我们专注于子宫内膜癌辅助治疗的建议,分析四个地区的反应(印度尼西亚,尼日利亚,台湾,美国)和三个平台(巴德,宾,ChatGPT-3.5).利用以前发表的案例,我们在24小时的窗口内从每个位置向聊天机器人提出了相同的问题。回答以双盲方式评估相关性,清晰度,深度,焦点,和十位子宫内膜癌专家的连贯性。我们的分析显示,不同国家/地区存在显著差异(p<0.001)。有趣的是,Bing\在尼日利亚的回应一贯优于其他国家(p<0.05),在所有评价标准中均表现优异(p<0.001)。与其他地区相比,巴德在尼日利亚的表现也更好(p<0.05),在所有类别中始终超过它们(p<0.001,相关性达到p<0.01)。值得注意的是,在所有地区,Bard的总分均显着高于ChatGPT-3.5和Bing(p<0.001)。这些发现强调了基于用户位置和平台的人工智能医疗信息质量的差异和机会。这强调了更多研究和开发的必要性,以确保通过AI技术平等获得可信赖的医疗信息。
    This study explores disparities and opportunities in healthcare information provided by AI chatbots. We focused on recommendations for adjuvant therapy in endometrial cancer, analyzing responses across four regions (Indonesia, Nigeria, Taiwan, USA) and three platforms (Bard, Bing, ChatGPT-3.5). Utilizing previously published cases, we asked identical questions to chatbots from each location within a 24-h window. Responses were evaluated in a double-blinded manner on relevance, clarity, depth, focus, and coherence by ten experts in endometrial cancer. Our analysis revealed significant variations across different countries/regions (p < 0.001). Interestingly, Bing\'s responses in Nigeria consistently outperformed others (p < 0.05), excelling in all evaluation criteria (p < 0.001). Bard also performed better in Nigeria compared to other regions (p < 0.05), consistently surpassing them across all categories (p < 0.001, with relevance reaching p < 0.01). Notably, Bard\'s overall scores were significantly higher than those of ChatGPT-3.5 and Bing in all locations (p < 0.001). These findings highlight disparities and opportunities in the quality of AI-powered healthcare information based on user location and platform. This emphasizes the necessity for more research and development to guarantee equal access to trustworthy medical information through AI technologies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能(AI)的发展对各个部门产生了重大影响,医疗保健见证了一些最具开创性的贡献。当代模特,例如ChatGPT-4和MicrosoftBing,展示了不仅仅是生成文本的能力,帮助复杂的任务,如文献搜索和完善基于Web的查询。
    目的:本研究探讨了一个令人信服的问题:AI能否独立撰写学术论文?我们的评估关注四个核心维度:相关性(确保AI的响应直接针对提示),准确性(以确定人工智能的信息在事实上是正确的和当前的),清晰度(检查人工智能呈现连贯和逻辑思想的能力),以及语气和风格(以评估AI是否可以与学术著作中预期的形式保持一致)。此外,我们将考虑将AI整合到学术写作中的道德含义和实用性。
    方法:为了评估ChatGPT-4和MicrosoftBing在一般实践中的学术论文援助的能力,我们采用了系统的方法。ChatGPT-4是OpenAI的高级AI语言模型,擅长生成类似人类的文本并根据用户交互调整响应,尽管它在2021年9月有一个知识截止。微软Bing的AI聊天机器人方便用户在Bing搜索引擎上进行导航,提供量身定制的搜索。
    结果:就相关性而言,ChatGPT-4深入研究了AI的医疗保健角色,引用学术资料,讨论不同的应用和关注,虽然微软Bing提供了一个简洁的,不太详细的概述。在准确性方面,ChatGPT-4正确引用了72%(23/32)的同行评审文章,但包含了一些不存在的参考文献。微软Bing的准确率为46%(6/13),辅以相关的非同行评审文章。在清晰度方面,两种模型都传达了清晰的信息,连贯的文本。ChatGPT-4特别擅长详细介绍技术概念,而微软Bing更笼统。在语气方面,两位模特都保持着学术的基调,但ChatGPT-4在内容交付方面表现出优越的深度和广度。
    结论:比较ChatGPT-4和MicrosoftBing在学术帮助方面的优势和局限性。ChatGPT-4在深度和相关性方面表现出色,但在引文准确性方面却步履蹒跚。MicrosoftBing简明扼要,但缺乏强大的细节。虽然这两种模式都有潜力,两者都不能独立处理全面的学术任务。随着AI的发展,将ChatGPT-4的深度与MicrosoftBing的最新引用相结合,可以优化学术支持。研究人员应该批判性地评估人工智能的产出,以保持学术可信度。
    BACKGROUND: The evolution of artificial intelligence (AI) has significantly impacted various sectors, with health care witnessing some of its most groundbreaking contributions. Contemporary models, such as ChatGPT-4 and Microsoft Bing, have showcased capabilities beyond just generating text, aiding in complex tasks like literature searches and refining web-based queries.
    OBJECTIVE: This study explores a compelling query: can AI author an academic paper independently? Our assessment focuses on four core dimensions: relevance (to ensure that AI\'s response directly addresses the prompt), accuracy (to ascertain that AI\'s information is both factually correct and current), clarity (to examine AI\'s ability to present coherent and logical ideas), and tone and style (to evaluate whether AI can align with the formality expected in academic writings). Additionally, we will consider the ethical implications and practicality of integrating AI into academic writing.
    METHODS: To assess the capabilities of ChatGPT-4 and Microsoft Bing in the context of academic paper assistance in general practice, we used a systematic approach. ChatGPT-4, an advanced AI language model by Open AI, excels in generating human-like text and adapting responses based on user interactions, though it has a knowledge cut-off in September 2021. Microsoft Bing\'s AI chatbot facilitates user navigation on the Bing search engine, offering tailored search.
    RESULTS: In terms of relevance, ChatGPT-4 delved deeply into AI\'s health care role, citing academic sources and discussing diverse applications and concerns, while Microsoft Bing provided a concise, less detailed overview. In terms of accuracy, ChatGPT-4 correctly cited 72% (23/32) of its peer-reviewed articles but included some nonexistent references. Microsoft Bing\'s accuracy stood at 46% (6/13), supplemented by relevant non-peer-reviewed articles. In terms of clarity, both models conveyed clear, coherent text. ChatGPT-4 was particularly adept at detailing technical concepts, while Microsoft Bing was more general. In terms of tone, both models maintained an academic tone, but ChatGPT-4 exhibited superior depth and breadth in content delivery.
    CONCLUSIONS: Comparing ChatGPT-4 and Microsoft Bing for academic assistance revealed strengths and limitations. ChatGPT-4 excels in depth and relevance but falters in citation accuracy. Microsoft Bing is concise but lacks robust detail. Though both models have potential, neither can independently handle comprehensive academic tasks. As AI evolves, combining ChatGPT-4\'s depth with Microsoft Bing\'s up-to-date referencing could optimize academic support. Researchers should critically assess AI outputs to maintain academic credibility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Editorial
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Letter
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:这项研究的目的是评估三个聊天机器人-OpenAIChatGPT,MicrosoftBing聊天(当前为Copilot),和GoogleBard(现为双子座)-就他们对一组定义的听力学问题的回应而言。
    方法:每个聊天机器人都有相同的10个问题。作者在1到5的Likert量表上对回答进行了评分。附加功能,例如不准确或错误的数量以及提供参考,也被检查过。
    结果:所有三个聊天机器人给出的大多数回答都被评为令人满意或更好。然而,所有聊天机器人都至少产生了一些错误或不准确。ChatGPT取得了最高的总分,而Bard是最差的.Bard也是唯一一个无法回答其中一个问题的聊天机器人。ChatGPT是唯一没有提供其来源信息的聊天机器人。
    结论:聊天机器人是一种有趣的工具,可用于访问听力学等专业领域的基本信息。然而,一个人需要小心,因为正确的信息并不经常与错误混合在一起,除非用户精通该领域。
    BACKGROUND: The purpose of this study was to evaluate three chatbots - OpenAI ChatGPT, Microsoft Bing Chat (currently Copilot), and Google Bard (currently Gemini) - in terms of their responses to a defined set of audiological questions.
    METHODS: Each chatbot was presented with the same 10 questions. The authors rated the responses on a Likert scale ranging from 1 to 5. Additional features, such as the number of inaccuracies or errors and the provision of references, were also examined.
    RESULTS: Most responses given by all three chatbots were rated as satisfactory or better. However, all chatbots generated at least a few errors or inaccuracies. ChatGPT achieved the highest overall score, while Bard was the worst. Bard was also the only chatbot unable to provide a response to one of the questions. ChatGPT was the only chatbot that did not provide information about its sources.
    CONCLUSIONS: Chatbots are an intriguing tool that can be used to access basic information in a specialized area like audiology. Nevertheless, one needs to be careful, as correct information is not infrequently mixed in with errors that are hard to pick up unless the user is well versed in the field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    人工智能(AI)有可能通过提高效率来改变乳房重建的术前计划,准确度,通过自动解释和穿孔器识别,放射学报告的可靠性。大型语言模型(LLM)最近在医学上取得了显着进步。这项研究旨在评估当代LLM在解释深下腹壁穿支(DIEP)皮瓣术前计划的计算机断层扫描血管造影(CTA)扫描方面的熟练程度。
    四个突出的LLM,ChatGPT-4,BARD,困惑,还有BingAI,回答了有关CTA扫描报告的六个问题。在乳房重建方面具有丰富经验的专家整形外科医生小组使用李克特量表评估了反应。相比之下,使用Flesch阅读轻松评分评估反应的可读性,Flesch-Kincaid等级,和Coleman-Liau指数.DISCERN评分用于确定响应的适用性。通过t检验确定统计学意义,P值<0.05被认为是显著的。
    BingAI对提示提供了最准确和有用的响应,接着是困惑,ChatGPT,然后是BARD.BingAI的体读易感得分最高(34.7±5.5),DISCERN得分最高(60.5±3.9)。与其他LLM相比,Pressch-Kincaid等级(20.5±2.7)和Coleman-Liau指数(17.8±1.6)得分更高。
    LLM在报告CTA进行乳房重建术前计划的能力方面表现出局限性,然而,技术的快速发展暗示了一个充满希望的未来。AI随时准备加强CTA报告的教育,并协助术前规划。在未来,人工智能技术可以提供自动CTA解释,提高效率,准确度,CTA报告的可靠性。
    UNASSIGNED: Artificial intelligence (AI) has the potential to transform preoperative planning for breast reconstruction by enhancing the efficiency, accuracy, and reliability of radiology reporting through automatic interpretation and perforator identification. Large language models (LLMs) have recently advanced significantly in medicine. This study aimed to evaluate the proficiency of contemporary LLMs in interpreting computed tomography angiography (CTA) scans for deep inferior epigastric perforator (DIEP) flap preoperative planning.
    UNASSIGNED: Four prominent LLMs, ChatGPT-4, BARD, Perplexity, and BingAI, answered six questions on CTA scan reporting. A panel of expert plastic surgeons with extensive experience in breast reconstruction assessed the responses using a Likert scale. In contrast, the responses\' readability was evaluated using the Flesch Reading Ease score, the Flesch-Kincaid Grade level, and the Coleman-Liau Index. The DISCERN score was utilized to determine the responses\' suitability. Statistical significance was identified through a t-test, and P-values < 0.05 were considered significant.
    UNASSIGNED: BingAI provided the most accurate and useful responses to prompts, followed by Perplexity, ChatGPT, and then BARD. BingAI had the greatest Flesh Reading Ease (34.7±5.5) and DISCERN (60.5±3.9) scores. Perplexity had higher Flesch-Kincaid Grade level (20.5±2.7) and Coleman-Liau Index (17.8±1.6) scores than other LLMs.
    UNASSIGNED: LLMs exhibit limitations in their capabilities of reporting CTA for preoperative planning of breast reconstruction, yet the rapid advancements in technology hint at a promising future. AI stands poised to enhance the education of CTA reporting and aid preoperative planning. In the future, AI technology could provide automatic CTA interpretation, enhancing the efficiency, accuracy, and reliability of CTA reports.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    简介人工智能(AI)聊天机器人有可能成为患者青光眼信息的主要来源,描述聊天机器人提供的信息是至关重要的,这样提供商就可以定制讨论,预测患者的担忧,识别误导性信息。因此,这项研究的目的是评估来自人工智能聊天机器人的青光眼信息,包括ChatGPT-4,巴德,还有Bing,通过分析响应精度,全面性,可读性,单词计数,以及与青光眼相关的美国眼科学会(AAO)患者材料进行比较。方法将AAO青光眼相关患者教育手册中的章节标题改编成问题形式,并向每个AI聊天机器人询问五次(ChatGPT-4,Bard,和Bing)。每个聊天机器人的两组响应用于评估AI聊天机器人响应和AAO手册信息的准确性,与AAO手册信息相比,人工智能聊天机器人响应的全面性,由三名独立的青光眼训练的眼科医生得分为1-5分。可读性(使用Flesch-Kincaid等级(FKGL)评估,对应于美国学校年级),单词计数,并确定所有聊天机器人响应和AAO手册部分的字符计数。结果AAO的准确性评分,ChatGPT,宾,和巴德分别为4.84、4.26、4.53和3.53。直接比较,AAO比ChatGPT更准确(p=0.002),而巴德是最不准确的(巴德与AAO,p<0.001;巴德与ChatGPT,p<0.002;Bard与Bing,p=0.001)。ChatGPT的反应最全面(ChatGPT与Bing,p<0.001;ChatGPT与Bardp=0.008),有了ChatGPT的全面性分数,宾,和巴德分别为3.32、2.16和2.79。AAO信息和Bard响应处于最可访问的可读性水平(AAO与ChatGPT,AAO对抗Bing,Bard对ChatGPT,吟游诗人对宾,所有p<0.0001),具有AAO的可读性级别,ChatGPT,宾,和巴德分别为8.11、13.01、11.73和7.90。Bing回答的单词和字符数最低。结论AI聊天机器人反应的准确性各不相同,全面性,和可读性。准确性得分和全面性低于AAO手册,可读性水平提高,AI聊天机器人需要改进,才能成为患者更有用的青光眼信息补充来源。医生必须意识到这些限制,以便向患者询问现有的知识和问题,然后提供明确和全面的信息。
    Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    该研究的目的是评估和比较五种不同的人工智能(AI)聊天机器人-ChatGPT产生的响应的质量和可读性,巴德,宾,厄尼,和Copilot-到顶部搜索的勃起功能障碍(ED)的查询。Google趋势被用来识别与ED相关的短语。每个AI聊天机器人都会收到一个由25个经常搜索的术语组成的特定序列作为输入。使用DISCERN评估反应,确保患者的质量信息(EQIP),和Flesch-Kincaid等级(FKGL)和阅读容易(FKRE)指标。搜索频率最高的前三个短语是“勃起功能障碍原因”,“如何勃起功能障碍,“和”勃起功能障碍治疗。\"津巴布韦,赞比亚,加纳对ED的兴趣最高。没有一个AI聊天机器人达到了必要的可读性。然而,巴德表现出显著更高的FKRE和FKGL评级(p=0.001),与其他聊天机器人相比,Copilot获得了更好的EQIP和DISCERN评级(p=0.001)。巴德表现出最简单的语言框架,在可读性和可理解性方面提出的挑战最小,而Copilot在ED上的文本质量优于其他聊天机器人。随着新的聊天机器人的引入,它们的可理解性和文本质量提高,为患者提供更好的指导。
    The aim of the study is to evaluate and compare the quality and readability of responses generated by five different artificial intelligence (AI) chatbots-ChatGPT, Bard, Bing, Ernie, and Copilot-to the top searched queries of erectile dysfunction (ED). Google Trends was used to identify ED-related relevant phrases. Each AI chatbot received a specific sequence of 25 frequently searched terms as input. Responses were evaluated using DISCERN, Ensuring Quality Information for Patients (EQIP), and Flesch-Kincaid Grade Level (FKGL) and Reading Ease (FKRE) metrics. The top three most frequently searched phrases were \"erectile dysfunction cause\", \"how to erectile dysfunction,\" and \"erectile dysfunction treatment.\" Zimbabwe, Zambia, and Ghana exhibited the highest level of interest in ED. None of the AI chatbots achieved the necessary degree of readability. However, Bard exhibited significantly higher FKRE and FKGL ratings (p = 0.001), and Copilot achieved better EQIP and DISCERN ratings than the other chatbots (p = 0.001). Bard exhibited the simplest linguistic framework and posed the least challenge in terms of readability and comprehension, and Copilot\'s text quality on ED was superior to the other chatbots. As new chatbots are introduced, their understandability and text quality increase, providing better guidance to patients.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号