OBJECTIVE: The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery.
METHODS: In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors.
RESULTS: The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively.
CONCLUSIONS: Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.
目的:本研究的目的是评估由ChatGPT和Bard生成的结构化摘要与人类撰写的摘要在脊柱外科领域的可重复性。
方法:总共,从7种著名期刊中随机选择60篇涉及脊柱部分的摘要,并用作ChatGPT和Bard输入语句,以根据提供的论文标题生成摘要。共174篇摘要,分为人类撰写的摘要,ChatGPT生成的摘要,和Bard生成的摘要,对期刊指南的结构化格式和内容的一致性进行了评估。使用iThenticate和ZeroGPT程序评估抄袭和AI输出的可能性,分别。脊柱领域共有8位评审员评估了30篇随机提取的摘要,以确定它们是由AI还是人类作者制作的。
结果:ChatGPT摘要中符合期刊格式指南的摘要比例(34/60,56.6%)高于Bard产生的摘要(6/54,11.1%;P<.001)。然而,与ChatGPT摘要(30/60,50%;P<.001)相比,Bard摘要的字数符合期刊指南的比例更高(49/54,90.7%)。ChatGPT生成的摘要的相似性指数(20.7%)显著低于Bard生成的摘要(32.1%;P<.001)。AI检测程序预测,21.7%(13/60)的人类群体,ChatGPT组的63.3%(38/60),Bard组的87%(47/54)可能是由人工智能产生的,曲线下面积值为0.863(P<.001)。人类评审员的平均检出率为53.8%(SD11.2%),灵敏度为56.3%,特异性为48.4%。共有56.3%(63/112)的实际人类撰写的摘要和55.9%(62/128)的人工智能生成的摘要被认为是人类撰写的和人工智能生成的。分别。
结论:ChatGPT和Bard都可以用来帮助编写摘要,但大多数人工智能生成的摘要目前被认为是不道德的,因为抄袭和人工智能检测率很高。ChatGPT生成的摘要在满足期刊格式指南方面似乎优于Bard生成的摘要。因为人类无法准确区分人类编写的摘要和人工智能程序产生的摘要,至关重要的是要特别谨慎,并检查使用AI程序的道德界限,包括ChatGPT和Bard.