Oversampling

过采样
  • 文章类型: Journal Article
    背景:支气管肺发育不良相关性肺动脉高压(BPD-PH)仍然是严重影响早产儿治疗结果的严重临床并发症。因此,早期预防和病理改变前的及时诊断是降低发病率和改善预后的关键。我们的主要目标是利用机器学习技术来建立预测模型,以准确识别患有PH风险的BPD婴儿。
    方法:本研究使用的数据来自中国四家三级医院的新生儿科。为了解决数据不平衡的问题,过采样算法采用合成少数过采样技术(SMOTE)对模型进行了改进。
    结果:在我们的研究中收集了761条临床记录。在数据预处理和特征选择之后,46个特征中有5个用于构建模型,包括有创呼吸支持的持续时间(天),BPD的严重程度,呼吸机相关性肺炎,肺出血,和早发性PH。四种机器学习模型被应用于预测学习,经过综合选择,最终选择了一个模型。该模型实现了93.8%的灵敏度,准确率85.0%,和0.933AUC。逻辑回归公式的得分大于0被识别为BPD-PH的警告信号。
    结论:我们综合比较了不同的机器学习模型,最终获得了良好的预后模型,足以支持儿科临床医生对BPD-PH患儿进行早期诊断和制定更好的治疗方案。
    BACKGROUND: Bronchopulmonary dysplasia-associated pulmonary hypertension (BPD-PH) remains a devastating clinical complication seriously affecting the therapeutic outcome of preterm infants. Hence, early prevention and timely diagnosis prior to pathological change is the key to reducing morbidity and improving prognosis. Our primary objective is to utilize machine learning techniques to build predictive models that could accurately identify BPD infants at risk of developing PH.
    METHODS: The data utilized in this study were collected from neonatology departments of four tertiary-level hospitals in China. To address the issue of imbalanced data, oversampling algorithms synthetic minority over-sampling technique (SMOTE) was applied to improve the model.
    RESULTS: Seven hundred sixty one clinical records were collected in our study. Following data pre-processing and feature selection, 5 of the 46 features were used to build models, including duration of invasive respiratory support (day), the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH. Four machine learning models were applied to predictive learning, and after comprehensive selection a model was ultimately selected. The model achieved 93.8% sensitivity, 85.0% accuracy, and 0.933 AUC. A score of the logistic regression formula greater than 0 was identified as a warning sign of BPD-PH.
    CONCLUSIONS: We comprehensively compared different machine learning models and ultimately obtained a good prognosis model which was sufficient to support pediatric clinicians to make early diagnosis and formulate a better treatment plan for pediatric patients with BPD-PH.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了提高变压器故障诊断的准确性,改善模型训练不足导致的不平衡样本对模型辨识精度低的影响,提出了一种基于SMOTE和NGO-GBDT的变压器故障诊断方法。首先,使用合成少数过采样技术(SMOTE)来扩展少数样本。其次,采用非编码比方法构造多维特征参数,引入光梯度提升机(LightGBM)特征优化策略筛选最优特征子集。最后,采用NorthernGoshawk优化(NGO)算法对梯度提升决策树(GBDT)参数进行优化,实现了变压器故障诊断。结果表明,该方法可以减少少数样本的误判。与其他集成模型相比,该方法具有较高的故障识别精度,误判率低,性能稳定。
    In order to improve the accuracy of transformer fault diagnosis and improve the influence of unbalanced samples on the low accuracy of model identification caused by insufficient model training, this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT. Firstly, the Synthetic Minority Over-sampling Technique (SMOTE) was used to expand the minority samples. Secondly, the non-coding ratio method was used to construct multi-dimensional feature parameters, and the Light Gradient Boosting Machine (LightGBM) feature optimization strategy was introduced to screen the optimal feature subset. Finally, Northern Goshawk Optimization (NGO) algorithm was used to optimize the parameters of Gradient Boosting Decision Tree (GBDT), and then the transformer fault diagnosis was realized. The results show that the proposed method can reduce the misjudgment of minority samples. Compared with other integrated models, the proposed method has high fault identification accuracy, low misjudgment rate and stable performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据不平衡是分类任务中的一个具有挑战性的问题,当与类重叠结合时,它进一步恶化了分类性能。然而,现有的研究很少同时解决这两个问题。在这篇文章中,我们提出了一种新的基于量子的过采样方法(QOSM),以有效地解决数据不平衡和类重叠,从而提高分类性能。QOSM利用量子势理论来计算每个样本的势能,并选择具有最低电势的样本作为建设性覆盖算法生成的每个覆盖的中心。这种方法优化了覆盖中心选择,更好地捕获原始样本的分布,特别是在重叠区域。此外,对少数类别覆盖的样本进行过采样,以减轻不平衡比(IR)。我们使用三种传统分类器(支持向量机[SVM],k-最近邻[KNN],和朴素贝叶斯[NB]分类器)在10个公开可用的KEEL数据集上,这些数据集以高IR和不同程度的重叠为特征。实验结果表明,与未解决类不平衡和重叠的方法相比,QOSM显着提高了分类准确性。此外,QOSM始终优于测试的现有过采样方法。由于它与不同分类器的兼容性,QOSM具有改善高度不平衡和重叠数据的分类性能的潜力。
    Data imbalance is a challenging problem in classification tasks, and when combined with class overlapping, it further deteriorates classification performance. However, existing studies have rarely addressed both issues simultaneously. In this article, we propose a novel quantum-based oversampling method (QOSM) to effectively tackle data imbalance and class overlapping, thereby improving classification performance. QOSM utilizes the quantum potential theory to calculate the potential energy of each sample and selects the sample with the lowest potential as the center of each cover generated by a constructive covering algorithm. This approach optimizes cover center selection and better captures the distribution of the original samples, particularly in the overlapping regions. In addition, oversampling is performed on the samples of the minority class covers to mitigate the imbalance ratio (IR). We evaluated QOSM using three traditional classifiers (support vector machines [SVM], k-nearest neighbor [KNN], and naive Bayes [NB] classifier) on 10 publicly available KEEL data sets characterized by high IRs and varying degrees of overlap. Experimental results demonstrate that QOSM significantly improves classification accuracy compared to approaches that do not address class imbalance and overlapping. Moreover, QOSM consistently outperforms existing oversampling methods tested. With its compatibility with different classifiers, QOSM exhibits promising potential to improve the classification performance of highly imbalanced and overlapped data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    白细胞介素(IL)是一组多功能的细胞因子,在免疫调节和炎症反应中起重要作用。最近,已发现IL-6会影响COVID-19的发展,据报道,重度COVID-19患者的IL-6细胞因子水平显着升高。IL-10和IL-17是抗炎和促炎细胞因子,分别,在宿主防御病原体中发挥多重保护作用。目前,已经提出了许多机器学习方法来预测IL诱导肽,但是它们的预测性能需要进一步提高,并分别预测不同IL的诱导肽,而不是使用一般的方法。在我们的工作中,我们将肽序列的统计特征与词嵌入相结合,设计了一个名为EnIL的通用集成模型来预测不同IL的诱导肽。其中随机森林的预测概率,极限梯度提升和神经网络以平均方式集成。与最先进的机器学习方法相比,EnIL在IL-6、IL-10和IL-17诱导肽的预测中显示出相当大的性能。此外,我们预测最有希望的IL-6诱导肽在严重急性呼吸综合征冠状病毒2刺突蛋白的案例研究中进行进一步的实验验证。
    Interleukins (ILs) are a group of multifunctional cytokines, which play important roles in immune regulations and inflammatory responses. Recently, IL-6 has been found to affect the development of COVID-19, and significantly elevated levels of IL-6 cytokines have been reported in patients with severe COVID-19. IL-10 and IL-17 are anti-inflammatory and proinflammatory cytokines, respectively, which play multiple protective roles in host defense against pathogens. At present, a number of machine learning methods have been proposed to predict ILs inducing peptides, but their predictive performance needs to be further improved, and the inducing peptides of different ILs are predicted separately, rather than using a general approach. In our work, we combine the statistical features of peptide sequence with word embedding to design a general ensemble model named EnILs to predict inducing peptides of different ILs, in which the predictive probabilities of random forest, eXtreme Gradient Boosting and neural network are integrated in an average way. Compared with the state-of-the-art machine learning methods, EnILs shows considerable performance in the prediction of IL-6, IL-10, and IL-17 inducing peptides. In addition, we predict the most promising IL-6 inducing peptides in Severe Acute Respiratory Syndrome Coronavirus 2 spike protein in the case study for further experimental verification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着物联网(IoT)的快速发展,攻击者使用僵尸网络来控制物联网设备以在互联网上执行分布式拒绝服务攻击(DDoS)和其他网络攻击的频率显着增加。在实际的攻击过程中,物联网中攻击数据包的比例较小,导致入侵检测的准确率较低。基于这个问题,本文提出了一种过采样算法,KG-SMOTE,基于高斯分布和K均值聚类,通过高斯概率分布插入合成样本,以相同的比例扩展少数类样本中的聚类节点,增加少数群体样本的密度,并提高少数类样本数据量,为基于物联网的DDoS攻击检测提供数据支持。实验表明,该方法生成的均衡数据集有效提高了每个类别的入侵检测准确率,有效解决了数据不均衡问题。
    With the rapid development of the Internet of Things (IoT), the frequency of attackers using botnets to control IoT devices in order to perform distributed denial-of-service attacks (DDoS) and other cyber attacks on the internet has significantly increased. In the actual attack process, the small percentage of attack packets in IoT leads to low accuracy of intrusion detection. Based on this problem, the paper proposes an oversampling algorithm, KG-SMOTE, based on Gaussian distribution and K-means clustering, which inserts synthetic samples through Gaussian probability distribution, extends the clustering nodes in minority class samples in the same proportion, increases the density of minority class samples, and improves the amount of minority class sample data in order to provide data support for IoT-based DDoS attack detection. Experiments show that the balanced dataset generated by this method effectively improves the intrusion detection accuracy in each category and effectively solves the data imbalance problem.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究旨在确定原发性肿瘤的影像学特征,并建立指示肝细胞癌(HCC)肝外转移的模型。177例HCC的对比增强计算机断层扫描(CT)图像,包括26个转移性(MET)和151个非转移性(非MET),进行回顾性收集和分析。对于每种情况,851个放射学特征,量化形状,强度,纹理,和动脉期最大肝癌肿瘤分割体积内的异质性,使用Pyradiacomics提取。数据集被随机分为训练集和测试集。进行了合成少数过采样技术(SMOTE)以将训练集扩展到145个MET和145个非MET病例。测试集由六个MET和六个非MET案例组成。外部验证集由从独立临床单位收集的20个MET和25个非MET病例组成。Logistic回归和支持向量机(SVM)模型的识别是基于使用逐步前向方法选择的特征,而深度卷积神经网络,视觉几何组16(VGG16),直接使用CT图像进行训练。灰度大小区域矩阵(GLSZM)特征构成了八个选定的转移预测因子中的四个,这归因于它们对肿瘤异质性的感知。放射学逻辑回归模型在测试集上产生0.944的受试者工作特征曲线下面积(AUROC),在外部验证集上产生0.744的AUROC。Logistic回归显示与SVM在性能上没有显着差异,并且明显优于VGG16。作为肝外转移检查,如胸部CT和骨闪烁显像,是标准但详尽的,影像组学模型有助于一种经济有效的方法,将HCC患者分为这些检查的合格组。
    This study aimed to identify radiomic features of primary tumor and develop a model for indicating extrahepatic metastasis of hepatocellular carcinoma (HCC). Contrast-enhanced computed tomographic (CT) images of 177 HCC cases, including 26 metastatic (MET) and 151 non-metastatic (non-MET), were retrospectively collected and analyzed. For each case, 851 radiomic features, which quantify shape, intensity, texture, and heterogeneity within the segmented volume of the largest HCC tumor in arterial phase, were extracted using Pyradiomics. The dataset was randomly split into training and test sets. Synthetic Minority Oversampling Technique (SMOTE) was performed to augment the training set to 145 MET and 145 non-MET cases. The test set consists of six MET and six non-MET cases. The external validation set is comprised of 20 MET and 25 non-MET cases collected from an independent clinical unit. Logistic regression and support vector machine (SVM) models were identified based on the features selected using the stepwise forward method while the deep convolution neural network, visual geometry group 16 (VGG16), was trained using CT images directly. Grey-level size zone matrix (GLSZM) features constitute four of eight selected predictors of metastasis due to their perceptiveness to the tumor heterogeneity. The radiomic logistic regression model yielded an area under receiver operating characteristic curve (AUROC) of 0.944 on the test set and an AUROC of 0.744 on the external validation set. Logistic regression revealed no significant difference with SVM in the performance and outperformed VGG16 significantly. As extrahepatic metastasis workups, such as chest CT and bone scintigraphy, are standard but exhaustive, radiomic model facilitates a cost-effective method for stratifying HCC patients into eligibility groups of these workups.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    早期和快速检测疾病对于对抗COVID-19大流行至关重要。研究人员专注于使用基于深度学习的胸部X射线图像处理开发稳健且具有成本效益的检测方法。然而,这样的预测模型通常不适合解决高度一致的数据集的挑战。当前的工作是尝试通过利用无监督的变分自动编码器(VAE)来解决该问题。首先,通过使用VAE学习最重要的特征,将胸部X射线图像转换为潜在空间。其次,广泛的成熟的数据重采样技术被用来平衡数据集潜在向量形式中预先存在的不平衡类。最后,新特征空间中的修改后的数据集用于训练众所周知的分类模型,以将胸部X射线图像分类为三个不同的类别即。,“COVID-19”,“肺炎”,和“正常”。为了捕获重采样方法的质量,对数据集应用10倍交叉验证技术。已经进行了广泛的实验分析,所获得的结果表明,使用所提出的基于VAE的方法,COVID-19检测得到了显着改善。此外,通过进行95%显著性水平的Wilcoxon秩检验,确定了结果的独创性.
    Early and fast detection of disease is essential for the fight against COVID-19 pandemic. Researchers have focused on developing robust and cost-effective detection methods using Deep learning based chest X-Ray image processing. However, such prediction models are often not well suited to address the challenge of highly imabalanced datasets. The current work is an attempt to address the issue by utilizing unsupervised Variational Auto Encoders (VAEs). Firstly, chest X-Ray images are converted to a latent space by learning the most important features using VAEs. Secondly, a wide range of well established data resampling techniques are used to balance the preexisting imbalanced classes in the latent vector form of the dataset. Finally, the modified dataset in the new feature space is used to train well known classification models to classify chest X-Ray images into three different classes viz., \"COVID-19\", \"Pneumonia\", and \"Normal\". In order to capture the quality of resampling methods, 10-folds cross validation technique is applied on the dataset. Extensive experimental analysis have been carried out and results so obtained indicate significant improvement in COVID-19 detection using the proposed VAE based method. Furthermore, the ingenuity of the results have been established by performing Wilcoxon rank test with 95% level of significance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:操作员准确理解口头命令的能力对于保持人机交互的性能至关重要。可以通过脑电图(EEG)测量的人类心理工作量来评估。然而,由于个体之间不同的心理生理过程,任务会话中不同工作量条件的持续时间是不相等的。这导致用于训练工作量分类器的EEG的数据不平衡。
    方法:在本研究中,我们提出了一种EEG特征过采样技术,基于高斯-SMOTE的特征集合(GSMOTE-FE),用于不平衡班级的工作量识别。首先,人工EEG实例是从少数和多数工作量类别之间的高斯分布中得出的。Tomek链接被检测为删除冗余特征向量的线索。然后,我们嵌入了一个基于GINI重要性的特征选择模块,而一个带有引导聚合的集成分类器委员会用于进一步提高分类性能。
    结果:我们基于一项实验来验证GSMOTE-FE框架,该实验模拟操作员以理解中文语言中说明的正确含义。记录参与者的EEG信号和反应时间数据,以验证所提出的工作量分类器。工作负载分类精度和宏F1值分别为0.6553和0.5862。相应的G-平均值和AUC分别达到0.5757和0.5958。
    结论:证明GSMOTE-FE的性能与先进的过采样技术相当。工作量分类器具有指示中文语言理解任务的任务需求的低水平和高水平的能力。
    OBJECTIVE: Operator\'s capability for accurately comprehending verbal commands is critically important to maintain the performance of human-machine interaction. It can be evaluated by human mental workload measured with electroencephalography (EEG). However, the time duration of different workload conditions within a task session is unequal due to varied psychophysiological processes across individuals. It leads to data imbalance of the EEG for training workload classifiers.
    METHODS: In this study, we propose an EEG feature oversampling technique, Gaussian-SMOTE based feature ensemble (GSMOTE-FE), for workload recognition with imbalanced classes. First, artificial EEG instances are drawn from a Gaussian distribution in the margin between the minority and majority workload classes. Tomek links are detected as clues to remove redundant feature vectors. Then, we embed a feature selection module based on the GINI importance while an ensemble classifier committee with bootstrap aggregating is used to further enhance classification performance.
    RESULTS: We validate the GSMOTE-FE framework based on an experiment that simulates operators to understand the correct meaning of the instructions in the Chinese language. Participants\' EEG signals and reaction time data were both recorded to validate the proposed workload classifier. Workload classification accuracy and Macro-F1 values are 0.6553 and 0.5862, respectively. Corresponding G-mean and AUC achieve at 0.5757 and 0.5958, respectively.
    CONCLUSIONS: The performance of the GSMOTE-FE is demonstrated to be comparable with the advanced oversampling techniques. The workload classifier has the capability to indicate low and high levels of the task demand of the Chinese language understanding task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:作为主要的健康危害,冠心病的发病率逐年上升。虽然冠状动脉血运重建,主要是经皮冠状动脉介入治疗,在冠心病的治疗中发挥了重要作用,冠状动脉血运重建后的复发或持续性心绞痛等主要不良心血管事件(MACE)在临床实践中仍然是一个非常困难的问题.
    目的:鉴于冠状动脉血运重建后发生MACE的概率较高,本研究的目的是开发并验证基于机器学习算法的6个月内MACE发生的预测模型.
    方法:回顾性研究纳入2019年6月至2020年12月在辽宁省人民医院和辽宁中医药大学附属医院行冠状动脉血运重建的1004例患者。根据现有数据的特点,初始预处理采用过采样策略。然后我们使用了六种机器学习算法,包括决策树,随机森林,逻辑回归,天真贝叶斯,支持向量机,和极端梯度提升(XGBoost),根据临床信息和6个月随访信息开发MACE预测模型。在所有样本中,随机选择70%进行训练,其余30%用于模型验证。模型性能是根据准确性进行评估的,精度,召回,F1分数,混淆矩阵,接收器工作特征(ROC)曲线(AUC)下面积,和可视化的ROC曲线。
    结果:单变量分析显示,无MACE和有MACE的组之间有21个患者特征变量有统计学意义(P<0.05)。加上这些重要因素,在六种机器学习算法中,XGBoost的准确度为0.7788,精确度为0.8058,召回率为0.7345,F1评分为0.7685,AUC为0.8599。对模型的进一步探索以确定影响MACE发生的因素表明,在三个开发的模型中,抗凝药物的使用和疾病的病程始终排在前两个预测因素中。
    结论:本研究中构建的机器学习风险模型可以实现可接受的MACE预测性能,与XGBoost表现最好的,为MACE预防的针对性干预和临床决策提供有价值的参考。
    BACKGROUND: As a major health hazard, the incidence of coronary heart disease has been increasing year by year. Although coronary revascularization, mainly percutaneous coronary intervention, has played an important role in the treatment of coronary heart disease, major adverse cardiovascular events (MACE) such as recurrent or persistent angina pectoris after coronary revascularization remain a very difficult problem in clinical practice.
    OBJECTIVE: Given the high probability of MACE after coronary revascularization, the aim of this study was to develop and validate a predictive model for MACE occurrence within 6 months based on machine learning algorithms.
    METHODS: A retrospective study was performed including 1004 patients who had undergone coronary revascularization at The People\'s Hospital of Liaoning Province and Affiliated Hospital of Liaoning University of Traditional Chinese Medicine from June 2019 to December 2020. According to the characteristics of available data, an oversampling strategy was adopted for initial preprocessing. We then employed six machine learning algorithms, including decision tree, random forest, logistic regression, naïve Bayes, support vector machine, and extreme gradient boosting (XGBoost), to develop prediction models for MACE depending on clinical information and 6-month follow-up information. Among all samples, 70% were randomly selected for training and the remaining 30% were used for model validation. Model performance was assessed based on accuracy, precision, recall, F1-score, confusion matrix, area under the receiver operating characteristic (ROC) curve (AUC), and visualization of the ROC curve.
    RESULTS: Univariate analysis showed that 21 patient characteristic variables were statistically significant (P<.05) between the groups without and with MACE. Coupled with these significant factors, among the six machine learning algorithms, XGBoost stood out with an accuracy of 0.7788, precision of 0.8058, recall of 0.7345, F1-score of 0.7685, and AUC of 0.8599. Further exploration of the models to identify factors affecting the occurrence of MACE revealed that use of anticoagulant drugs and course of the disease consistently ranked in the top two predictive factors in three developed models.
    CONCLUSIONS: The machine learning risk models constructed in this study can achieve acceptable performance of MACE prediction, with XGBoost performing the best, providing a valuable reference for pointed intervention and clinical decision-making in MACE prevention.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    不平衡分类广泛存在于医学诊断领域,生物医学,智慧城市和物联网。数据分布的不均衡使得传统的分类方法更偏向于多数类,忽视了少数类的重要性。这使得传统的分类方法在不平衡分类中失效。在本文中,提出了一种基于深度学习和模糊支持向量机的不平衡分类方法,并命名为DFSVM。DFSVM首先使用深度神经网络来获取数据的嵌入表示。该深度神经网络通过使用三元组损失来训练,以增强类内的相似性和类之间的差异。为了缓解数据分布不平衡的影响,在数据的嵌入空间中进行过采样。在本文中,我们使用基于特征和中心距离的过采样方法,这可以获得更多样化的新样本并防止过拟合。为了增强少数民族的影响力,我们使用基于代价敏感学习的模糊支持向量机(FSVM)作为最终分类器。FSVM为少数类样本分配了较高的误分类成本,以提高分类质量。对多个生物数据集和真实世界数据集进行实验。实验结果表明,DFSVM取得了良好的分类性能。
    Imbalanced classification is widespread in the fields of medical diagnosis, biomedicine, smart city and Internet of Things. The imbalance of data distribution makes traditional classification methods more biased towards majority classes and ignores the importance of minority class. It makes the traditional classification methods ineffective in imbalanced classification. In this paper, a novel imbalance classification method based on deep learning and fuzzy support vector machine is proposed and named as DFSVM. DFSVM first uses a deep neural network to obtain an embedding representation of the data. This deep neural network is trained by using triplet loss to enhance similarities within classes and differences between classes. To alleviate the effects of imbalanced data distribution, oversampling is performed in the embedding space of the data. In this paper, we use an oversampling method based on feature and center distance, which can obtain more diverse new samples and prevent overfitting. To enhance the impact of minority class, we use a fuzzy support vector machine (FSVM) based on cost-sensitive learning as the final classifier. FSVM assigns a higher misclassification cost to minority class samples to improve the classification quality. Experiments were performed on multiple biological datasets and real-world datasets. The experimental results show that DFSVM has achieved promising classification performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号