背景:ChatGPT,公开可用的人工智能(AI)大型语言模型,允许按需使用复杂的AI技术。的确,ChatGPT的使用已经开始进入医学研究。然而,医学界还没有理解人工智能在这种背景下的能力和伦理考虑,关于ChatGPT的写作能力存在未知因素,准确度,以及对作者身份的影响。
目的:我们假设人类审稿人和AI检测软件在正确识别妇科和妇科科目中的原始已发表摘要和AI撰写摘要的能力方面存在差异。我们还怀疑写作错误的具体差异,可读性,感知的写作质量存在于原始文本和人工智能生成的文本之间。
方法:选择发表在高影响力医学期刊上的25篇文章以及一系列妇科和泌尿外科期刊。ChatGPT被提示编写25个对应的AI生成的摘要,提供抽象标题,期刊规定的抽象要求,并选择原始结果。原始和AI生成的摘要由盲法妇科和泌尿妇科的教职员工和研究员进行审查,以将写作识别为原始或AI生成。所有摘要均通过公开可用的AI检测软件GPTZero进行分析,独创性,和Copyleaks,并由AI写作助理Grammarly评估了写作错误和质量。
结果:26名教职员工和4名研究员对25份原创和25份人工智能生成的摘要进行了一百五十七份评论。正确识别了57%的原始摘要和42.3%的AI生成摘要,在所有摘要中平均为49.7%。所有三个AI检测器都将原始摘要评为AI编写的可能性低于ChatGPT生成的摘要(GPTZero5.8对73.3%,p<0.001;原创性10.9比98.1%,p<0.001;Copyleaks18.6vs58.2%,p<0.001)。在分析所有摘要时,三个AI检测软件的性能不同(p=0.03),原始摘要(p<0.001),和人工智能生成的摘要(p<0.001)。与AI摘要相比,语法文本分析发现了更多的原始写作问题和正确性错误,包括较低的语法分数反映较差的写作质量(82.3vs88.1,p=0.006),更多的写作问题(19.2对12.8,p<0.001),关键问题(5.4vs1.3,p<0.001),混淆词(0.8vs0.1,p=0.006),拼写错误的单词(1.7vs0.6,p=0.02),不正确的确定器使用(1.2对0.2,p=0.002),和逗号误用(0.3vs0.0,p=0.005)。
结论:由于AI能够生成非常真实的文本,人类评论者无法检测到人类和ChatGPT生成的科学写作之间的细微差别。AI检测软件改善了对AI生成的写作的识别,但仍然缺乏完全的准确性,并且需要进行编程改进以实现最佳检测。由于审阅者和编辑可能无法可靠地检测AI生成的片段,随着AI聊天机器人获得更广泛的使用,需要建立明确的指南来报告作者使用AI的情况并在审查过程中实施AI检测软件。
ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT\'s writing abilities, accuracy, and implications for authorship.
We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text.
Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly.
A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; P<.001; Originality, 10.9% vs 98.1%; P<.001; Copyleaks, 18.6% vs 58.2%; P<.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (P=.03), original abstracts (P<.001), and artificial intelligence-generated abstracts (P<.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; P=.006), more total writing issues (19.2 vs 12.8; P<.001), critical issues (5.4 vs 1.3; P<.001), confusing words (0.8 vs 0.1; P=.006), misspelled words (1.7 vs 0.6; P=.02), incorrect determiner use (1.2 vs 0.2; P=.002), and comma misuse (0.3 vs 0.0; P=.005).
Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence\'s ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.