Class balancing

  • 文章类型: Journal Article
    物联网(IoT)设备正在引领创新的进步,效率,以及各行各业的可持续性。然而,随着连接的物联网设备数量的增加,入侵风险成为物联网安全的主要关注点。为了防止入侵,实施能够检测和防止此类攻击的入侵检测系统(IDS)至关重要。IDS是网络安全基础设施的重要组成部分。它们旨在检测和响应网络或系统中的恶意活动。传统的IDS方法依赖于预定义的签名或规则来识别已知的威胁,但是这些技术可能很难检测到新颖或复杂的攻击。采用机器学习(ML)和深度学习(DL)技术的IDS的实现被提出来提高IDS检测攻击的能力。这将增强整体网络安全态势和弹性。然而,ML和DL技术面临几个可能影响模型性能和有效性的问题,例如过度拟合和不重要特征对找到有意义的模式的影响。为了确保在处理新的和看不见的威胁时,在IDS中提高机器学习模型的性能和可靠性,模型需要优化。这可以通过解决过拟合和实现特征选择来完成。在本文中,我们提出了一种通过使用类平衡和特征选择进行预处理来优化物联网入侵检测的方案。我们通过实现两种不同的集成模型来评估UNSW-NB15数据集和NSL-KD数据集上的实验:一种使用带有装袋的支持向量机(SVM),另一种使用带有堆叠的长短期记忆(LSTM)。性能和混淆矩阵的结果表明,具有方差分析的LSTM堆叠(ANOVA)特征选择模型是对网络攻击进行分类的优越模型。在两个数据集上,它具有96.92%和99.77%的显著准确度和0.33%和0.04%的过拟合值,分别。该模型的ROC也具有尖锐的弯曲形状,UNSW-NB15数据集和NSL-KD数据集的AUC值为0.9665和0.9971,分别。
    Internet of Things (IoT) devices are leading to advancements in innovation, efficiency, and sustainability across various industries. However, as the number of connected IoT devices increases, the risk of intrusion becomes a major concern in IoT security. To prevent intrusions, it is crucial to implement intrusion detection systems (IDSs) that can detect and prevent such attacks. IDSs are a critical component of cybersecurity infrastructure. They are designed to detect and respond to malicious activities within a network or system. Traditional IDS methods rely on predefined signatures or rules to identify known threats, but these techniques may struggle to detect novel or sophisticated attacks. The implementation of IDSs with machine learning (ML) and deep learning (DL) techniques has been proposed to improve IDSs\' ability to detect attacks. This will enhance overall cybersecurity posture and resilience. However, ML and DL techniques face several issues that may impact the models\' performance and effectiveness, such as overfitting and the effects of unimportant features on finding meaningful patterns. To ensure better performance and reliability of machine learning models in IDSs when dealing with new and unseen threats, the models need to be optimized. This can be done by addressing overfitting and implementing feature selection. In this paper, we propose a scheme to optimize IoT intrusion detection by using class balancing and feature selection for preprocessing. We evaluated the experiment on the UNSW-NB15 dataset and the NSL-KD dataset by implementing two different ensemble models: one using a support vector machine (SVM) with bagging and another using long short-term memory (LSTM) with stacking. The results of the performance and the confusion matrix show that the LSTM stacking with analysis of variance (ANOVA) feature selection model is a superior model for classifying network attacks. It has remarkable accuracies of 96.92% and 99.77% and overfitting values of 0.33% and 0.04% on the two datasets, respectively. The model\'s ROC is also shaped with a sharp bend, with AUC values of 0.9665 and 0.9971 for the UNSW-NB15 dataset and the NSL-KD dataset, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在当代社会,抑郁症已成为一种突出的精神障碍,表现出指数增长,并对过早死亡产生重大影响。尽管许多研究应用机器学习方法来预测抑郁症的迹象。然而,只有有限数量的研究将严重性级别作为多类变量考虑在内.此外,在实际社区中,保持所有类之间数据分布的平等很少发生。所以,多个变量不可避免的类不平衡被认为是该领域的重大挑战。此外,这项研究强调了在多班级背景下解决班级不平衡问题的重要性。我们在数据预处理阶段引入了一种新的特征组划分(FGP)方法,该方法有效地将特征的维度降至最低。这项研究利用了合成过采样技术,特别是合成少数过采样技术(SMOTE)和自适应合成(ADASYN),类平衡。本研究中使用的数据集是通过管理烧伤抑郁症清单(BDC)从大学生那里收集的。对于方法上的修改,我们实现了异构集成学习堆叠,均匀合奏装袋,和五种不同的监督机器学习算法。通过评估训练的准确性,缓解了过拟合的问题,验证,和测试数据集。为了证明预测模型的有效性,平衡精度,灵敏度,特异性,精度,并使用f1分数指数。总的来说,综合分析证明了传统抑郁症筛查(CDS)和FGP方法之间的区别。总之,结果表明,采用SMOTE方法的FGP堆叠分类器具有最高的平衡精度,率92.81%。经验证据表明,FGP方法,当与SMOTE结合时,能够在预测抑郁症的严重程度方面产生更好的表现。最重要的是,优化所有分类器的FGP方法的训练时间是本研究的一项重大成就。
    In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    From smart homes to industrial environments, the IoT is an ally to easing daily activities, where some of them are critical. More and more devices are connected to and through the Internet, which, given the large amount of different manufacturers, may lead to a lack of security standards. Denial of service attacks (DDoS, DoS) represent the most common and critical attack against and from these networks, and in the third quarter of 2021, there was an increase of 31% (compared to the same period of 2020) in the total number of advanced DDoS targeted attacks. This work uses the Bot-IoT dataset, addressing its class imbalance problem, to build a novel Intrusion Detection System based on Machine Learning and Deep Learning models. In order to evaluate how the records timestamps affect the predictions, we used three different feature sets for binary and multiclass classifications; this helped us avoid feature dependencies, as produced by the Argus flow data generator, whilst achieving an average accuracy >99%. Then, we conducted comprehensive experimentation, including time performance evaluation, matching and exceeding the results of the current state-of-the-art for identifying denial of service attacks, where the Decision Tree and Multi-layer Perceptron models were the best performing methods to identify DDoS and DoS attacks over IoT networks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们展示了DeepVesselNet,一种针对使用深度学习在3-D血管造影体积中提取血管树和网络以及相应特征时面临的挑战而量身定制的架构。我们讨论了与完整的3D网络相关的低执行速度和高内存要求的问题,由血管体素的低百分比(<3%)引起的高级不平衡,以及准确注释的3D训练数据的不可用性-并提供解决方案作为DeepVesselNet的构建块。首先,我们制定2-D正交十字线过滤器,利用3-D上下文信息在减少计算负担。第二,我们引入了一类平衡交叉熵损失函数,并进行了假阳性率校正,以处理与现有损失函数相关的高级不平衡和高假阳性率问题。最后,我们使用计算血管生成模型生成一个合成数据集,该模型能够在局部网络结构和拓扑的生理约束下模拟血管树的生长,并将这些数据用于迁移学习.我们展示了在不同空间尺度下的一系列血管造影体积的性能,包括人类大脑的临床MRA数据。以及大鼠大脑的CTA显微镜扫描。我们的结果表明,十字线过滤器的速度提高了23%以上,更低的内存占用,更低的网络复杂性,防止过度拟合和可比的精度,没有不同于完整的3-D滤波器。我们的班级平衡指标对于训练网络至关重要,用合成数据进行迁移学习是一种高效的,健壮,和非常普遍的方法导致一个网络,擅长各种血管造影分割任务。我们观察到,子采样和最大池化层可能会导致涉及体素大小结构的任务性能下降。为此,DeepVesselNet架构不使用任何形式的子采样层,很好地用于血管分割,中心线预测,和分叉检测。我们公开我们的合成训练数据,促进未来的研究,并作为首批用于脑血管树分割和分析的公共数据集之一。
    We present DeepVesselNet, an architecture tailored to the challenges faced when extracting vessel trees and networks and corresponding features in 3-D angiographic volumes using deep learning. We discuss the problems of low execution speed and high memory requirements associated with full 3-D networks, high-class imbalance arising from the low percentage (<3%) of vessel voxels, and unavailability of accurately annotated 3-D training data-and offer solutions as the building blocks of DeepVesselNet. First, we formulate 2-D orthogonal cross-hair filters which make use of 3-D context information at a reduced computational burden. Second, we introduce a class balancing cross-entropy loss function with false-positive rate correction to handle the high-class imbalance and high false positive rate problems associated with existing loss functions. Finally, we generate a synthetic dataset using a computational angiogenesis model capable of simulating vascular tree growth under physiological constraints on local network structure and topology and use these data for transfer learning. We demonstrate the performance on a range of angiographic volumes at different spatial scales including clinical MRA data of the human brain, as well as CTA microscopy scans of the rat brain. Our results show that cross-hair filters achieve over 23% improvement in speed, lower memory footprint, lower network complexity which prevents overfitting and comparable accuracy that does not differ from full 3-D filters. Our class balancing metric is crucial for training the network, and transfer learning with synthetic data is an efficient, robust, and very generalizable approach leading to a network that excels in a variety of angiography segmentation tasks. We observe that sub-sampling and max pooling layers may lead to a drop in performance in tasks that involve voxel-sized structures. To this end, the DeepVesselNet architecture does not use any form of sub-sampling layer and works well for vessel segmentation, centerline prediction, and bifurcation detection. We make our synthetic training data publicly available, fostering future research, and serving as one of the first public datasets for brain vessel tree segmentation and analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号