评估 ChatGPT 在科学研究中模仿人类审稿人的能力：一种描述性和定性的方法。Assessing ChatGPT's ability to emulate human reviewers in scientific research: A descriptive and qualitative approach.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: ChatGPT is an AI platform whose relevance in the peer review of scientific articles is steadily growing. Nonetheless, it has sparked debates over its potential biases and inaccuracies. This study aims to assess ChatGPT\'s ability to qualitatively emulate human reviewers in scientific research.
METHODS: We included the first submitted version of the latest twenty original research articles published by the 3rd of July 2023, in a high-profile medical journal. Each article underwent evaluation by a minimum of three human reviewers during the initial review stage. Subsequently, three researchers with medical backgrounds and expertise in manuscript revision, independently and qualitatively assessed the agreement between the peer reviews generated by ChatGPT version GPT-4 and the comments provided by human reviewers for these articles. The level of agreement was categorized into complete, partial, none, or contradictory.
RESULTS: 720 human reviewers\' comments were assessed. There was a good agreement between the three assessors (Overall kappa >0.6). ChatGPT\'s comments demonstrated complete agreement in terms of quality and substance with 48 (6.7 %) human reviewers\' comments, partially agreed with 92 (12.8 %), identifying issues necessitating further elaboration or recommending supplementary steps to address concerns, had no agreement with a significant 565 (78.5 %), and contradicted 15 (2.1 %). ChatGPT comments on methods had the lowest proportion of complete agreement (13 comments, 3.6 %), while general comments on the manuscript displayed the highest proportion of complete agreement (17 comments, 22.1 %).
CONCLUSIONS: ChatGPT version GPT-4 has a limited ability to emulate human reviewers within the peer review process of scientific research.

摘要：

背景：ChatGPT是一个AI平台，其在科学文章的同行评审中的相关性正在稳步增长。尽管如此，它引发了关于其潜在偏见和不准确性的辩论。这项研究旨在评估ChatGPT在科学研究中定性模仿人类评论者的能力。
方法：我们将2023年7月3日发表的最新20篇原创研究文章的第一篇提交版本纳入了一份备受瞩目的医学杂志。在最初的审查阶段，每篇文章都经过了至少三名审查人员的评估。随后，三名具有医学背景和手稿修订专业知识的研究人员，独立和定性地评估了ChatGPT版本GPT-4产生的同行评审与人类评审员对这些文章提供的评论之间的一致性.协议的级别分为完整，局部,无,或者矛盾。
结果：对720位人类评审员的评论进行了评估。三位评估员之间达成了很好的协议(总体kappa>0.6)。ChatGPT的评论在质量和实质方面与48位（6.7％）人类评论者的评论完全一致，部分同意92(12.8%)，确定需要进一步阐述或建议补充步骤以解决关切的问题，与重要的565（78.5%）没有协议，与15（2.1％）相矛盾。ChatGPT评论对方法的完全一致比例最低(13条评论，3.6%)，虽然对手稿的一般性评论显示完全同意的比例最高(17条评论，22.1%）。
结论：ChatGPT版本GPT-4在科学研究的同行评审过程中模仿人类评审者的能力有限。