关键词: ChatGPT Ensemble learning Plagiarism Stylometry Writing style

来  源:   DOI:10.1016/j.heliyon.2024.e32976   PDF(Pubmed)

Abstract:
Extensive use of AI-generated texts culminated recently after the advent of large language models. Although the use of AI text generators, such as ChatGPT, is beneficial, it also threatens the academic level as students may resort to it. In this work, we propose a technique leveraging the intrinsic stylometric features of documents to detect ChatGPT-based plagiarism. The stylometric features were normalized and fed to classical classifiers, such as k-Nearest Neighbors, Decision Tree, and Naïve Bayes, as well as ensemble classifiers, such as XGBoost and Stacking. A thorough examination of the classifier was conducted by using Cross-Fold validation, hyperparameters tuning, and multiple training iterations. The results show the efficacy of both classical and ensemble learning classifiers in distinguishing between human and ChatGPT writing styles with a noteworthy performance of XGBoost where 100 % was achieved for accuracy, recall, and precision metrics. Moreover, the proposed XGBoost classifier outperformed the state-of-the-art result on the same dataset and same classifier highlighting the superiority of the proposed feature style extraction method over TF-IDF techniques. The ensemble learning classifiers were also applied to the generated dataset with mixed texts, where paragraphs are written by ChatGPT and humans. The results show that 98 % of the documents were classified correctly as either mixed or human. The last contribution consists in the authorship attribution of the paragraphs of a single document where the accuracy reached 92.3 %.
摘要:
在大型语言模型出现后,人工智能生成的文本的广泛使用最近达到了顶峰。虽然使用AI文本生成器,比如ChatGPT,是有益的,它也威胁到学术水平,因为学生可能会诉诸于它。在这项工作中,我们提出了一种利用文档的内在样式特征来检测基于ChatGPT的抄袭的技术。样式特征被归一化并馈送给经典分类器,比如k-最近的邻居,决策树,和朴素贝叶斯,以及集成分类器,例如XGBoost和Stacking。通过使用交叉折叠验证对分类器进行彻底检查,超参数调整,和多次训练迭代。结果表明,经典和集成学习分类器在区分人类和ChatGPT写作风格方面的功效,具有值得注意的XGBoost性能,其中准确率达到100%,召回,和精度指标。此外,在相同的数据集和相同的分类器上,所提出的XGBoost分类器优于最新的结果,突出了所提出的特征样式提取方法优于TF-IDF技术。集成学习分类器也被应用于具有混合文本的生成数据集,其中段落由ChatGPT和人类编写。结果表明,98%的文件被正确地分类为混合或人类。最后的贡献在于单个文档段落的作者身份归属,其中准确性达到92.3%。
公众号