关键词: GAN Imbalanced class data Minor sample Neural network Oversampling SMOTE

Mesh : Algorithms Machine Learning Support Vector Machine Probability

来  源:   DOI:10.1016/j.neunet.2024.106157

Abstract:
Class imbalance problem (CIP) in a dataset is a major challenge that significantly affects the performance of Machine Learning (ML) models resulting in biased predictions. Numerous techniques have been proposed to address CIP, including, but not limited to, Oversampling, Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) are the most widely used methodology by researchers. However, one of SMOTE\'s potential disadvantages is that newly created minor samples overlap with major samples. Therefore, the probability of ML models\' biased performance toward major classes increases. Generative adversarial network (GAN) has recently garnered much attention due to their ability to create real samples. However, GAN is hard to train even though it has much potential. Considering these opportunities, this work proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome the limitations of the existing approaches. The preliminary results show that SSG and GBO performed better on the nine imbalanced benchmark datasets than several existing SMOTE-based approaches. Additionally, it can be observed that the proposed SSG and GBO methods can accurately classify the minor class with more than 90% accuracy when tested with 20%, 30%, and 40% of the test data. The study also revealed that the minor sample generated by SSG demonstrates Gaussian distributions, which is often difficult to achieve using original SMOTE and SVM-SMOTE.
摘要:
数据集中的类不平衡问题(CIP)是一个重大挑战,会显著影响机器学习(ML)模型的性能,从而导致有偏差的预测。已经提出了许多技术来解决CIP,包括,但不限于,过采样,欠采样,和成本敏感的方法。由于其生成合成数据的能力,诸如合成少数过采样技术(SMOTE)之类的过采样技术是研究人员使用最广泛的方法。然而,SMOTE的潜在缺点之一是新创建的次要样本与主要样本重叠。因此,ML模型偏向主要类别的概率会增加。由于生成对抗网络(GAN)能够创建真实样本,因此最近引起了很多关注。然而,GAN很难训练,尽管它有很大的潜力。考虑到这些机会,这项工作提出了两种新颖的技术:基于GAN的过采样(GBO)和支持向量机-SMOTE-GAN(SSG)来克服现有方法的局限性。初步结果表明,SSG和GBO在9个不平衡基准数据集上的表现比几种现有的基于SMOTE的方法更好。此外,可以观察到,所提出的SSG和GBO方法可以准确地对次要类别进行分类,当测试时,准确率超过90%,30%,和40%的测试数据。研究还表明,SSG产生的次要样本表现出高斯分布,使用原始SMOTE和SVM-SMOTE通常很难实现。
公众号