Oversampling

过采样
  • 文章类型: Journal Article
    在当代社会,抑郁症已成为一种突出的精神障碍,表现出指数增长,并对过早死亡产生重大影响。尽管许多研究应用机器学习方法来预测抑郁症的迹象。然而,只有有限数量的研究将严重性级别作为多类变量考虑在内.此外,在实际社区中,保持所有类之间数据分布的平等很少发生。所以,多个变量不可避免的类不平衡被认为是该领域的重大挑战。此外,这项研究强调了在多班级背景下解决班级不平衡问题的重要性。我们在数据预处理阶段引入了一种新的特征组划分(FGP)方法,该方法有效地将特征的维度降至最低。这项研究利用了合成过采样技术,特别是合成少数过采样技术(SMOTE)和自适应合成(ADASYN),类平衡。本研究中使用的数据集是通过管理烧伤抑郁症清单(BDC)从大学生那里收集的。对于方法上的修改,我们实现了异构集成学习堆叠,均匀合奏装袋,和五种不同的监督机器学习算法。通过评估训练的准确性,缓解了过拟合的问题,验证,和测试数据集。为了证明预测模型的有效性,平衡精度,灵敏度,特异性,精度,并使用f1分数指数。总的来说,综合分析证明了传统抑郁症筛查(CDS)和FGP方法之间的区别。总之,结果表明,采用SMOTE方法的FGP堆叠分类器具有最高的平衡精度,率92.81%。经验证据表明,FGP方法,当与SMOTE结合时,能够在预测抑郁症的严重程度方面产生更好的表现。最重要的是,优化所有分类器的FGP方法的训练时间是本研究的一项重大成就。
    In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    机器学习模型正在彻底改变我们发现和设计生物活性肽的方法。这些模型通常需要蛋白质结构意识,因为他们严重依赖顺序数据。这些模型擅长识别特定生物学性质或活性的序列,但他们往往无法理解其复杂的行动机制。要同时解决两个问题,我们研究了抗菌肽作为(i)膜破坏肽的作用机制和结构景观,(ii)膜穿透性肽,和(iii)蛋白结合肽。通过分析关键特征,如二肽和物理化学描述符,我们开发了预测这些类别的高精度模型(86-88%).然而,我们的初始模型(1.0和2.0)表现出倾向于α-螺旋和盘绕结构,影响预测。为了解决这种结构偏差,我们实施了子集选择和数据缩减策略。前者给出了三种可能折叠成α螺旋的肽的结构特异性模型(模型1.1和2.1),线圈(1.3和2.3),或混合结构(1.4和2.4)。后者耗尽了过度代表的结构,导致结构不可知的预测因子1.5和2.5。此外,我们的研究强调了重要特征对不同模型结构类别的敏感性。
    Machine learning models are revolutionizing our approaches to discovering and designing bioactive peptides. These models often need protein structure awareness, as they heavily rely on sequential data. The models excel at identifying sequences of a particular biological nature or activity, but they frequently fail to comprehend their intricate mechanism(s) of action. To solve two problems at once, we studied the mechanisms of action and structural landscape of antimicrobial peptides as (i) membrane-disrupting peptides, (ii) membrane-penetrating peptides, and (iii) protein-binding peptides. By analyzing critical features such as dipeptides and physicochemical descriptors, we developed models with high accuracy (86-88%) in predicting these categories. However, our initial models (1.0 and 2.0) exhibited a bias towards α-helical and coiled structures, influencing predictions. To address this structural bias, we implemented subset selection and data reduction strategies. The former gave three structure-specific models for peptides likely to fold into α-helices (models 1.1 and 2.1), coils (1.3 and 2.3), or mixed structures (1.4 and 2.4). The latter depleted over-represented structures, leading to structure-agnostic predictors 1.5 and 2.5. Additionally, our research highlights the sensitivity of important features to different structure classes across models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:支气管肺发育不良相关性肺动脉高压(BPD-PH)仍然是严重影响早产儿治疗结果的严重临床并发症。因此,早期预防和病理改变前的及时诊断是降低发病率和改善预后的关键。我们的主要目标是利用机器学习技术来建立预测模型,以准确识别患有PH风险的BPD婴儿。
    方法:本研究使用的数据来自中国四家三级医院的新生儿科。为了解决数据不平衡的问题,过采样算法采用合成少数过采样技术(SMOTE)对模型进行了改进。
    结果:在我们的研究中收集了761条临床记录。在数据预处理和特征选择之后,46个特征中有5个用于构建模型,包括有创呼吸支持的持续时间(天),BPD的严重程度,呼吸机相关性肺炎,肺出血,和早发性PH。四种机器学习模型被应用于预测学习,经过综合选择,最终选择了一个模型。该模型实现了93.8%的灵敏度,准确率85.0%,和0.933AUC。逻辑回归公式的得分大于0被识别为BPD-PH的警告信号。
    结论:我们综合比较了不同的机器学习模型,最终获得了良好的预后模型,足以支持儿科临床医生对BPD-PH患儿进行早期诊断和制定更好的治疗方案。
    BACKGROUND: Bronchopulmonary dysplasia-associated pulmonary hypertension (BPD-PH) remains a devastating clinical complication seriously affecting the therapeutic outcome of preterm infants. Hence, early prevention and timely diagnosis prior to pathological change is the key to reducing morbidity and improving prognosis. Our primary objective is to utilize machine learning techniques to build predictive models that could accurately identify BPD infants at risk of developing PH.
    METHODS: The data utilized in this study were collected from neonatology departments of four tertiary-level hospitals in China. To address the issue of imbalanced data, oversampling algorithms synthetic minority over-sampling technique (SMOTE) was applied to improve the model.
    RESULTS: Seven hundred sixty one clinical records were collected in our study. Following data pre-processing and feature selection, 5 of the 46 features were used to build models, including duration of invasive respiratory support (day), the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH. Four machine learning models were applied to predictive learning, and after comprehensive selection a model was ultimately selected. The model achieved 93.8% sensitivity, 85.0% accuracy, and 0.933 AUC. A score of the logistic regression formula greater than 0 was identified as a warning sign of BPD-PH.
    CONCLUSIONS: We comprehensively compared different machine learning models and ultimately obtained a good prognosis model which was sufficient to support pediatric clinicians to make early diagnosis and formulate a better treatment plan for pediatric patients with BPD-PH.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    冠状病毒病(COVID-19)被世界卫生组织(WHO)宣布为大流行疾病,到目前为止还没有结束。由于COVID-19的感染率增加,与常规诊断相比,需要采用计算方法来预测感染COVID-19的患者,以加快诊断时间并最大限度地减少人为错误.然而,负数据的数量高于正数据的数量会导致数据不平衡的情况,从而影响分类性能。导致模型评估结果存在偏差。这项研究提出了一种新的过采样技术,即,TRIM-SBR,生成诊断感染COVID-19患者的次要类别数据。由于数据的泛化问题,开发过采样技术仍然具有挑战性。所提出的方法是基于修剪,通过寻找特定的少数民族地区,同时保留数据泛化,产生少数数据种子,作为使用引导重采样技术创建新的合成数据的基准。准确性,特异性,灵敏度,F-measure,和AUC用于评估数据不平衡情况下的分类器性能。结果表明,与其他过采样技术相比,TRIM-SBR方法提供了最佳性能。
    The Coronavirus Disease (COVID-19) was declared a pandemic disease by the World Health Organization (WHO), and it has not ended so far. Since the infection rate of the COVID-19 increases, the computational approach is needed to predict patients infected with COVID-19 in order to speed up the diagnosis time and minimize human error compared to conventional diagnoses. However, the number of negative data that is higher than positive data can result in a data imbalance situation that affects the classification performance, resulting in a bias in the model evaluation results. This study proposes a new oversampling technique, i.e., TRIM-SBR, to generate the minor class data for diagnosing patients infected with COVID-19. It is still challenging to develop the oversampling technique due to the data\'s generalization issue. The proposed method is based on pruning by looking for specific minority areas while retaining data generalization, resulting in minority data seeds that serve as benchmarks in creating new synthesized data using bootstrap resampling techniques. Accuracy, Specificity, Sensitivity, F-measure, and AUC are used to evaluate classifier performance in data imbalance cases. The results show that the TRIM-SBR method provides the best performance compared to other oversampling techniques.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了提高变压器故障诊断的准确性,改善模型训练不足导致的不平衡样本对模型辨识精度低的影响,提出了一种基于SMOTE和NGO-GBDT的变压器故障诊断方法。首先,使用合成少数过采样技术(SMOTE)来扩展少数样本。其次,采用非编码比方法构造多维特征参数,引入光梯度提升机(LightGBM)特征优化策略筛选最优特征子集。最后,采用NorthernGoshawk优化(NGO)算法对梯度提升决策树(GBDT)参数进行优化,实现了变压器故障诊断。结果表明,该方法可以减少少数样本的误判。与其他集成模型相比,该方法具有较高的故障识别精度,误判率低,性能稳定。
    In order to improve the accuracy of transformer fault diagnosis and improve the influence of unbalanced samples on the low accuracy of model identification caused by insufficient model training, this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT. Firstly, the Synthetic Minority Over-sampling Technique (SMOTE) was used to expand the minority samples. Secondly, the non-coding ratio method was used to construct multi-dimensional feature parameters, and the Light Gradient Boosting Machine (LightGBM) feature optimization strategy was introduced to screen the optimal feature subset. Finally, Northern Goshawk Optimization (NGO) algorithm was used to optimize the parameters of Gradient Boosting Decision Tree (GBDT), and then the transformer fault diagnosis was realized. The results show that the proposed method can reduce the misjudgment of minority samples. Compared with other integrated models, the proposed method has high fault identification accuracy, low misjudgment rate and stable performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:全基因组关联研究已成功鉴定出与人类疾病相关的遗传变异。最近已经提出了基于惩罚和机器学习方法的各种统计方法用于疾病预测。在这项研究中,我们使用韩国基因组和流行病学研究(KoGES)的韩国芯片(KORV1.1)评估了几种此类方法预测哮喘的性能.
    结果:首先,通过单变异检测,采用logistic回归分析并调整了几个流行病学因素,筛选出单核苷酸多态性.接下来,我们评估了以下疾病预测方法:里奇,最小绝对收缩和选择运算符,弹性网,平滑地削减绝对偏差,支持向量机,随机森林,升压,装袋,天真贝叶斯,和k最近的邻居。最后,我们根据接收器工作特性曲线的曲线下面积比较了它们的预测性能,精度,召回,F1分数,Cohen\'sKappa,平衡精度,错误率,马修斯相关系数,和精确召回率曲线下的面积。此外,三种过采样算法用于处理不平衡问题。
    结论:我们的结果表明,与通过机器学习方法相比,惩罚方法对哮喘表现出更好的预测性能。另一方面,在过抽样研究中,随机森林和增强方法总体上显示出比惩罚方法更好的预测性能。
    BACKGROUND: Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES).
    RESULTS: First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen\'s Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems.
    CONCLUSIONS: Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据不平衡是分类任务中的一个具有挑战性的问题,当与类重叠结合时,它进一步恶化了分类性能。然而,现有的研究很少同时解决这两个问题。在这篇文章中,我们提出了一种新的基于量子的过采样方法(QOSM),以有效地解决数据不平衡和类重叠,从而提高分类性能。QOSM利用量子势理论来计算每个样本的势能,并选择具有最低电势的样本作为建设性覆盖算法生成的每个覆盖的中心。这种方法优化了覆盖中心选择,更好地捕获原始样本的分布,特别是在重叠区域。此外,对少数类别覆盖的样本进行过采样,以减轻不平衡比(IR)。我们使用三种传统分类器(支持向量机[SVM],k-最近邻[KNN],和朴素贝叶斯[NB]分类器)在10个公开可用的KEEL数据集上,这些数据集以高IR和不同程度的重叠为特征。实验结果表明,与未解决类不平衡和重叠的方法相比,QOSM显着提高了分类准确性。此外,QOSM始终优于测试的现有过采样方法。由于它与不同分类器的兼容性,QOSM具有改善高度不平衡和重叠数据的分类性能的潜力。
    Data imbalance is a challenging problem in classification tasks, and when combined with class overlapping, it further deteriorates classification performance. However, existing studies have rarely addressed both issues simultaneously. In this article, we propose a novel quantum-based oversampling method (QOSM) to effectively tackle data imbalance and class overlapping, thereby improving classification performance. QOSM utilizes the quantum potential theory to calculate the potential energy of each sample and selects the sample with the lowest potential as the center of each cover generated by a constructive covering algorithm. This approach optimizes cover center selection and better captures the distribution of the original samples, particularly in the overlapping regions. In addition, oversampling is performed on the samples of the minority class covers to mitigate the imbalance ratio (IR). We evaluated QOSM using three traditional classifiers (support vector machines [SVM], k-nearest neighbor [KNN], and naive Bayes [NB] classifier) on 10 publicly available KEEL data sets characterized by high IRs and varying degrees of overlap. Experimental results demonstrate that QOSM significantly improves classification accuracy compared to approaches that do not address class imbalance and overlapping. Moreover, QOSM consistently outperforms existing oversampling methods tested. With its compatibility with different classifiers, QOSM exhibits promising potential to improve the classification performance of highly imbalanced and overlapped data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项研究中,我们提出了一种基于蓝牙低功耗(BLE)技术在护理和护理人员室内定位中重新标记数据的机器学习增强方法。室内定位用于监控护理人员对患者的护理帮助,并深入了解工作量管理。然而,当可用于训练的数据量有限时,提高准确性是一项挑战。在本文中,我们提出了一种数据增强方法,通过重新标记到样本较少的位置来重用来自不同信标的接收信号强度(RSS),解决数据不平衡问题。少数和多数类别之间的标准偏差和Kullback-Leibler分歧用于测量信号模式,以找到要重新标记的匹配信标。通过匹配类之间的信标,实现了重新标记的两种变体,特别是完全和部分匹配。使用我们在安装有25个BLE信标的护理设施中收集的五天的真实世界数据集来评估性能。随机森林模型用于位置识别,并使用加权F1分数比较性能,以说明班级不平衡。通过使用我们提出的用于数据增强的重新标记方法来增加信标数据,与随机抽样的增强相比,我们获得了更高的少数民族F1分数,合成少数过采样技术(SMOTE)和自适应合成采样(ADASYN).我们提出的方法通过利用多数类样本来利用收集的信标数据。完全匹配显示相对于原始基线总体加权F1得分6至8%的改善。
    In this study, we propose an augmentation method for machine learning based on relabeling data in caregiving and nursing staff indoor localization with Bluetooth Low Energy (BLE) technology. Indoor localization is used to monitor staff-to-patient assistance in caregiving and to gain insights into workload management. However, improving accuracy is challenging when there is a limited amount of data available for training. In this paper, we propose a data augmentation method to reuse the Received Signal Strength (RSS) from different beacons by relabeling to the locations with less samples, resolving data imbalance. Standard deviation and Kullback-Leibler divergence between minority and majority classes are used to measure signal pattern to find matching beacons to relabel. By matching beacons between classes, two variations of relabeling are implemented, specifically full and partial matching. The performance is evaluated using the real-world dataset we collected for five days in a nursing care facility installed with 25 BLE beacons. A Random Forest model is utilized for location recognition, and performance is compared using the weighted F1-score to account for class imbalance. By increasing the beacon data with our proposed relabeling method for data augmentation, we achieve a higher minority class F1-score compared to augmentation with Random Sampling, Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling (ADASYN). Our proposed method utilizes collected beacon data by leveraging majority class samples. Full matching demonstrated a 6 to 8% improvement from the original baseline overall weighted F1-score.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着物联网(IoT)的快速发展,攻击者使用僵尸网络来控制物联网设备以在互联网上执行分布式拒绝服务攻击(DDoS)和其他网络攻击的频率显着增加。在实际的攻击过程中,物联网中攻击数据包的比例较小,导致入侵检测的准确率较低。基于这个问题,本文提出了一种过采样算法,KG-SMOTE,基于高斯分布和K均值聚类,通过高斯概率分布插入合成样本,以相同的比例扩展少数类样本中的聚类节点,增加少数群体样本的密度,并提高少数类样本数据量,为基于物联网的DDoS攻击检测提供数据支持。实验表明,该方法生成的均衡数据集有效提高了每个类别的入侵检测准确率,有效解决了数据不均衡问题。
    With the rapid development of the Internet of Things (IoT), the frequency of attackers using botnets to control IoT devices in order to perform distributed denial-of-service attacks (DDoS) and other cyber attacks on the internet has significantly increased. In the actual attack process, the small percentage of attack packets in IoT leads to low accuracy of intrusion detection. Based on this problem, the paper proposes an oversampling algorithm, KG-SMOTE, based on Gaussian distribution and K-means clustering, which inserts synthetic samples through Gaussian probability distribution, extends the clustering nodes in minority class samples in the same proportion, increases the density of minority class samples, and improves the amount of minority class sample data in order to provide data support for IoT-based DDoS attack detection. Experiments show that the balanced dataset generated by this method effectively improves the intrusion detection accuracy in each category and effectively solves the data imbalance problem.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在现有文献中,大规模评估中的作弊检测受到了相当多的关注。然而,在这一领域的研究中,以前的研究都没有研究过用于作弊检测的堆叠集成机器学习算法。此外,没有研究使用重采样解决班级不平衡的问题。本研究探索了堆叠集成机器学习算法在分析项目响应中的应用,响应时间,和增强考生的数据来检测作弊行为。将堆叠方法的性能与其他两种集成方法(装袋和提升)以及六种基本非集成机器学习算法的性能进行了比较。解决了与类不平衡和输入功能有关的问题。研究结果表明,堆叠,重新采样,和包括增强摘要数据在内的功能集在作弊检测方面通常比其他功能集表现更好。与本研究中研究的其他竞争机器学习算法相比,在所有研究条件中,当使用项目响应和增广汇总统计量作为输入特征时,使用基于前两个基本模型-梯度提升和随机森林-的判别分析的堆叠元模型通常表现最佳,且低抽样比率为10:1.
    Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号