关键词: AI Bard ChatGPT abstract artificial intelligence chatbot ethics formatting guidelines journal guidelines language model orthopedic surgery plagiarism scientific abstract spine spine surgery surgery

Mesh : Humans Spine / surgery Abstracting and Indexing / standards methods Reproducibility of Results Artificial Intelligence Writing / standards

来  源:   DOI:10.2196/52001

Abstract:
BACKGROUND: Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as \"Gemini\"; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy.
OBJECTIVE: The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery.
METHODS: In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors.
RESULTS: The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively.
CONCLUSIONS: Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.
摘要:
背景:由于人工智能(AI)的最新进展,语言模型应用程序可以生成逻辑文本输出,很难与人类写作区分开。ChatGPT(OpenAI)和Bard(随后更名为“Gemini”;GoogleAI)是使用不同的方法开发的,但是关于它们产生摘要的能力差异的研究很少。在脊柱外科领域使用AI撰写科学摘要是许多争论和争议的中心。
目的:本研究的目的是评估由ChatGPT和Bard生成的结构化摘要与人类撰写的摘要在脊柱外科领域的可重复性。
方法:总共,从7种著名期刊中随机选择60篇涉及脊柱部分的摘要,并用作ChatGPT和Bard输入语句,以根据提供的论文标题生成摘要。共174篇摘要,分为人类撰写的摘要,ChatGPT生成的摘要,和Bard生成的摘要,对期刊指南的结构化格式和内容的一致性进行了评估。使用iThenticate和ZeroGPT程序评估抄袭和AI输出的可能性,分别。脊柱领域共有8位评审员评估了30篇随机提取的摘要,以确定它们是由AI还是人类作者制作的。
结果:ChatGPT摘要中符合期刊格式指南的摘要比例(34/60,56.6%)高于Bard产生的摘要(6/54,11.1%;P<.001)。然而,与ChatGPT摘要(30/60,50%;P<.001)相比,Bard摘要的字数符合期刊指南的比例更高(49/54,90.7%)。ChatGPT生成的摘要的相似性指数(20.7%)显著低于Bard生成的摘要(32.1%;P<.001)。AI检测程序预测,21.7%(13/60)的人类群体,ChatGPT组的63.3%(38/60),Bard组的87%(47/54)可能是由人工智能产生的,曲线下面积值为0.863(P<.001)。人类评审员的平均检出率为53.8%(SD11.2%),灵敏度为56.3%,特异性为48.4%。共有56.3%(63/112)的实际人类撰写的摘要和55.9%(62/128)的人工智能生成的摘要被认为是人类撰写的和人工智能生成的。分别。
结论:ChatGPT和Bard都可以用来帮助编写摘要,但大多数人工智能生成的摘要目前被认为是不道德的,因为抄袭和人工智能检测率很高。ChatGPT生成的摘要在满足期刊格式指南方面似乎优于Bard生成的摘要。因为人类无法准确区分人类编写的摘要和人工智能程序产生的摘要,至关重要的是要特别谨慎,并检查使用AI程序的道德界限,包括ChatGPT和Bard.
公众号