ADASYN

ADASYN
  • 文章类型: Journal Article
    睡眠是人类健康的重要生理过程,准确检测各种睡眠状态对于诊断睡眠障碍至关重要。这项研究提出了一种使用EEG信号识别睡眠阶段的新算法,比最先进的方法更有效和准确。关键创新在于在时域中采用称为Halfwave方法的分段线性数据缩减技术。该方法将EEG信号简化为具有降低的复杂性的分段线性形式,同时保留睡眠阶段特征。然后,使用从简化的分段线性函数获得的参数构建具有六个统计特征的特征向量。我们使用MIT-BIH多导睡眠数据库来测试我们提出的方法,其中包括来自不同生物医学信号的超过80小时的长数据,具有六个主要的睡眠类别。我们使用了不同的分类器,发现K最近邻分类器在我们提出的方法中表现更好。根据实验结果,平均灵敏度,特异性,所提出的算法在多导睡眠数据库上考虑8条记录的准确率估计为94.82%,96.65%,和95.73%,分别。此外,该算法在计算效率上显示出了希望,使其适用于实时应用,如睡眠监测设备。它在各种睡眠类别中的强劲表现表明它有可能被广泛的临床采用,在知识方面取得重大进展,检测,和睡眠问题的管理。
    Sleep is a vital physiological process for human health, and accurately detecting various sleep states is crucial for diagnosing sleep disorders. This study presents a novel algorithm for identifying sleep stages using EEG signals, which is more efficient and accurate than the state-of-the-art methods. The key innovation lies in employing a piecewise linear data reduction technique called the Halfwave method in the time domain. This method simplifies EEG signals into a piecewise linear form with reduced complexity while preserving sleep stage characteristics. Then, a features vector with six statistical features is built using parameters obtained from the reduced piecewise linear function. We used the MIT-BIH Polysomnographic Database to test our proposed method, which includes more than 80 h of long data from different biomedical signals with six main sleep classes. We used different classifiers and found that the K-Nearest Neighbor classifier performs better in our proposed method. According to experimental findings, the average sensitivity, specificity, and accuracy of the proposed algorithm on the Polysomnographic Database considering eight records is estimated as 94.82%, 96.65%, and 95.73%, respectively. Furthermore, the algorithm shows promise in its computational efficiency, making it suitable for real-time applications such as sleep monitoring devices. Its robust performance across various sleep classes suggests its potential for widespread clinical adoption, making significant advances in the knowledge, detection, and management of sleep problems.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在当代社会,抑郁症已成为一种突出的精神障碍,表现出指数增长,并对过早死亡产生重大影响。尽管许多研究应用机器学习方法来预测抑郁症的迹象。然而,只有有限数量的研究将严重性级别作为多类变量考虑在内.此外,在实际社区中,保持所有类之间数据分布的平等很少发生。所以,多个变量不可避免的类不平衡被认为是该领域的重大挑战。此外,这项研究强调了在多班级背景下解决班级不平衡问题的重要性。我们在数据预处理阶段引入了一种新的特征组划分(FGP)方法,该方法有效地将特征的维度降至最低。这项研究利用了合成过采样技术,特别是合成少数过采样技术(SMOTE)和自适应合成(ADASYN),类平衡。本研究中使用的数据集是通过管理烧伤抑郁症清单(BDC)从大学生那里收集的。对于方法上的修改,我们实现了异构集成学习堆叠,均匀合奏装袋,和五种不同的监督机器学习算法。通过评估训练的准确性,缓解了过拟合的问题,验证,和测试数据集。为了证明预测模型的有效性,平衡精度,灵敏度,特异性,精度,并使用f1分数指数。总的来说,综合分析证明了传统抑郁症筛查(CDS)和FGP方法之间的区别。总之,结果表明,采用SMOTE方法的FGP堆叠分类器具有最高的平衡精度,率92.81%。经验证据表明,FGP方法,当与SMOTE结合时,能够在预测抑郁症的严重程度方面产生更好的表现。最重要的是,优化所有分类器的FGP方法的训练时间是本研究的一项重大成就。
    In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    脑肿瘤的早期诊断在医疗保健领域至关重要,由于潜在的危及生命的影响,大脑内的不稳定生长可能对个体造成影响。脑肿瘤的准确和早期诊断可以迅速进行医疗干预。在这种情况下,我们建立了一种称为MTAP的新模型,以实现对脑肿瘤的高度准确诊断。MTAP模型通过利用ADASYN方法解决数据集类不平衡,采用网络修剪技术来减少神经网络中不必要的权重和节点,并采用Avg-TopK池化方法进行增强特征提取。我们研究的主要目标是提高脑肿瘤类型检测的准确性,医学成像和诊断的关键方面。MTAP模型为脑肿瘤引入了一种新的分类策略,利用深度学习方法和新颖的模型细化技术的优势。经过全面的实验研究和精心设计,MTAP模型实现了99.69%的最新精度。我们的发现表明,使用深度学习和创新的模型改进技术在促进脑肿瘤的早期检测方面显示出希望。对模型热图的分析显示,重点关注包括顶叶和颞叶的区域。
    The early diagnosis of brain tumors is critical in the area of healthcare, owing to the potentially life-threatening repercussions unstable growths within the brain can pose to individuals. The accurate and early diagnosis of brain tumors enables prompt medical intervention. In this context, we have established a new model called MTAP to enable a highly accurate diagnosis of brain tumors. The MTAP model addresses dataset class imbalance by utilizing the ADASYN method, employs a network pruning technique to reduce unnecessary weights and nodes in the neural network, and incorporates Avg-TopK pooling method for enhanced feature extraction. The primary goal of our research is to enhance the accuracy of brain tumor type detection, a critical aspect of medical imaging and diagnostics. The MTAP model introduces a novel classification strategy for brain tumors, leveraging the strength of deep learning methods and novel model refinement techniques. Following comprehensive experimental studies and meticulous design, the MTAP model has achieved a state-of-the-art accuracy of 99.69%. Our findings indicate that the use of deep learning and innovative model refinement techniques shows promise in facilitating the early detection of brain tumors. Analysis of the model\'s heat map revealed a notable focus on regions encompassing the parietal and temporal lobes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    阿尔茨海默病是一种无法治愈的神经系统疾病,导致认知能力逐渐下降,但是早期发现可以显著减轻症状。由于专业医务人员的短缺,阿尔茨海默病的自动诊断更加重要,因为它减轻了医务人员的负担,提高了诊断的结果。需要通过分段磁共振成像(MRI)对特定脑部疾病组织进行详细分析才能准确诊断疾病。一些研究使用传统的机器学习方法从MRI诊断疾病,但是手动提取的特征更复杂,耗时,并且需要专家医务人员的大量参与。传统的方法不能提供准确的诊断。深度学习具有自动提取特征并优化训练过程。磁共振成像(MRI)阿尔茨海默病数据集包括四类:轻度痴呆(896张图像),中度痴呆(64张图片),非痴呆(3200图像),和非常温和的痴呆(2240图像)。数据集高度不平衡。因此,我们使用自适应合成过采样技术来解决这个问题。应用此技术后,数据集是平衡的。使用VGG16和EfficientNet的集合在不平衡和平衡数据集上检测阿尔茨海默病,以验证模型的性能。所提出的方法结合了多个模型的预测,形成了一个集成模型,该模型从数据中学习了复杂而细微的模式。将两个模型的输入和输出连接以形成集成模型,然后将其添加到其他层上以形成更强大的模型。在这项研究中,我们提出了EfficientNet-B2和VGG-16的集合,以最高的准确性在早期诊断疾病。实验是在两个公开可用的数据集上进行的。实验结果表明,该方法对多类数据集的准确率为97.35%,AUC为99.64%,对二类数据集的准确率为97.09%,AUC为99.59%。我们评估了所提出的方法非常有效,并且与以前的方法相比,在两个数据集上都提供了优越的性能。
    Alzheimer\'s disease is an incurable neurological disorder that leads to a gradual decline in cognitive abilities, but early detection can significantly mitigate symptoms. The automatic diagnosis of Alzheimer\'s disease is more important due to the shortage of expert medical staff, because it reduces the burden on medical staff and enhances the results of diagnosis. A detailed analysis of specific brain disorder tissues is required to accurately diagnose the disease via segmented magnetic resonance imaging (MRI). Several studies have used the traditional machine-learning approaches to diagnose the disease from MRI, but manual extracted features are more complex, time-consuming, and require a huge amount of involvement from expert medical staff. The traditional approach does not provide an accurate diagnosis. Deep learning has automatic extraction features and optimizes the training process. The Magnetic Resonance Imaging (MRI) Alzheimer\'s disease dataset consists of four classes: mild demented (896 images), moderate demented (64 images), non-demented (3200 images), and very mild demented (2240 images). The dataset is highly imbalanced. Therefore, we used the adaptive synthetic oversampling technique to address this issue. After applying this technique, the dataset was balanced. The ensemble of VGG16 and EfficientNet was used to detect Alzheimer\'s disease on both imbalanced and balanced datasets to validate the performance of the models. The proposed method combined the predictions of multiple models to make an ensemble model that learned complex and nuanced patterns from the data. The input and output of both models were concatenated to make an ensemble model and then added to other layers to make a more robust model. In this study, we proposed an ensemble of EfficientNet-B2 and VGG-16 to diagnose the disease at an early stage with the highest accuracy. Experiments were performed on two publicly available datasets. The experimental results showed that the proposed method achieved 97.35% accuracy and 99.64% AUC for multiclass datasets and 97.09% accuracy and 99.59% AUC for binary-class datasets. We evaluated that the proposed method was extremely efficient and provided superior performance on both datasets as compared to previous methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在纠正医疗数据中的“类不平衡”问题时,重采样对分类器算法的影响尚不清楚.我们研究了分类器和重采样率的几种组合对性能的影响。
    在7个重新采样的数据集上训练了多种分类算法:没有校正,随机欠采样,合成少数过采样技术(SMOTE)的4个比率,和随机过采样与自适应合成算法(ADASYN)。以曲线下面积(AUC)评估性能,精度,召回,Brier得分,和校准指标。介绍了先前入院的泌尿外科患者30天非计划再入院的预测模型的案例研究。
    对于大多数算法,使用重采样数据显示AUC和精度显着增加,范围从0.74(CI:0.69-0.79)到0.93(CI:0.92-0.94),和0.35(CI:0.12-0.58)至0.86(CI:0.81-0.92)。所有分类算法都显示出召回率显着增加,并且在失真校准高估阳性的情况下,Brier得分显着降低。
    不平衡校正导致整体性能提高,但校准不佳的模型。由于强大的辨别性能,仍然可以有临床效用,特别是在仅预测低风险和高风险病例时,在临床上更具相关性。
    重采样数据提高了分类算法的性能,然而却高估了积极的预测。根据我们案例研究的结果,周到的临床预测任务的预定义可以指导在未来的研究中使用重采样技术,旨在改善临床决策支持工具。
    UNASSIGNED: When correcting for the \"class imbalance\" problem in medical data, the effects of resampling applied on classifier algorithms remain unclear. We examined the effect on performance over several combinations of classifiers and resampling ratios.
    UNASSIGNED: Multiple classification algorithms were trained on 7 resampled datasets: no correction, random undersampling, 4 ratios of Synthetic Minority Oversampling Technique (SMOTE), and random oversampling with the Adaptive Synthetic algorithm (ADASYN). Performance was evaluated in Area Under the Curve (AUC), precision, recall, Brier score, and calibration metrics. A case study on prediction modeling for 30-day unplanned readmissions in previously admitted Urology patients was presented.
    UNASSIGNED: For most algorithms, using resampled data showed a significant increase in AUC and precision, ranging from 0.74 (CI: 0.69-0.79) to 0.93 (CI: 0.92-0.94), and 0.35 (CI: 0.12-0.58) to 0.86 (CI: 0.81-0.92) respectively. All classification algorithms showed significant increases in recall, and significant decreases in Brier score with distorted calibration overestimating positives.
    UNASSIGNED: Imbalance correction resulted in an overall improved performance, yet poorly calibrated models. There can still be clinical utility due to a strong discriminating performance, specifically when predicting only low and high risk cases is clinically more relevant.
    UNASSIGNED: Resampling data resulted in increased performances in classification algorithms, yet produced an overestimation of positive predictions. Based on the findings from our case study, a thoughtful predefinition of the clinical prediction task may guide the use of resampling techniques in future studies aiming to improve clinical decision support tools.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    阿尔茨海默病(AD)是一种引起痴呆和神经功能障碍的神经脑紊乱,影响记忆,行为,和认知。深度学习(DL)一种人工智能(AI),为新的AD检测和自动化方法铺平了道路。DL模型的预测精度取决于数据集的大小。当数据集存在不平衡类问题时,DL模型会失去准确性。本研究旨在利用深度卷积神经网络(CNN)开发一种可靠有效的方法,利用MRI识别阿尔茨海默病。在这项研究中,我们提供了一种新的CNN架构,用于诊断阿尔茨海默病,使其非常适合训练较小的数据集。这个提出的模型正确地分离了阿尔茨海默病的早期阶段,并将大脑上的类激活模式显示为热图。拟议的阿尔茨海默病检测网络(DAD-Net)从头开始开发,以正确分类阿尔茨海默病的阶段,同时减少参数和计算成本。KaggleMRI图像数据集存在严重的类不平衡问题。因此,我们使用了合成过采样技术来将图像分布在整个类,并避免了这个问题。Precision,召回,F1分数,曲线下面积(AUC),和损失都用于将拟议的DAD-Net与DEMENET和CNN模型进行比较。为了准确,AUC,F1分数,精度,和回忆,DAD-Net实现了以下评估指标的值:99.22%,99.91%,99.19%,99.30%,99.14%,分别。提出的DAD-Net在所有评估指标上都优于其他最先进的模型,根据仿真结果。
    Alzheimer\'s Disease (AD) is a neurological brain disorder that causes dementia and neurological dysfunction, affecting memory, behavior, and cognition. Deep Learning (DL), a kind of Artificial Intelligence (AI), has paved the way for new AD detection and automation methods. The DL model\'s prediction accuracy depends on the dataset\'s size. The DL models lose their accuracy when the dataset has an imbalanced class problem. This study aims to use the deep Convolutional Neural Network (CNN) to develop a reliable and efficient method for identifying Alzheimer\'s disease using MRI. In this study, we offer a new CNN architecture for diagnosing Alzheimer\'s disease with a modest number of parameters, making it perfect for training a smaller dataset. This proposed model correctly separates the early stages of Alzheimer\'s disease and displays class activation patterns on the brain as a heat map. The proposed Detection of Alzheimer\'s Disease Network (DAD-Net) is developed from scratch to correctly classify the phases of Alzheimer\'s disease while reducing parameters and computation costs. The Kaggle MRI image dataset has a severe problem with class imbalance. Therefore, we used a synthetic oversampling technique to distribute the image throughout the classes and avoid the problem. Precision, recall, F1-score, Area Under the Curve (AUC), and loss are all used to compare the proposed DAD-Net against DEMENET and CNN Model. For accuracy, AUC, F1-score, precision, and recall, the DAD-Net achieved the following values for evaluation metrics: 99.22%, 99.91%, 99.19%, 99.30%, and 99.14%, respectively. The presented DAD-Net outperforms other state-of-the-art models in all evaluation metrics, according to the simulation results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    稳定的预测模型对于预测剖宫产或剖腹产(CS)分娩的机会至关重要,因为不必要的CS分娩会对新生儿产生不利影响,母性,以及儿科发病率和死亡率,并可能招致重大的财政负担。近年来,有限的最先进的机器学习模型已应用于该领域,当前的模型不足以正确预测CS交付的概率。为了减轻这个缺点,我们提出了一种基于亨利气体溶解度优化(HGSO)的随机森林(RF),有了改进的目标函数,叫做HGSORF,用于CS和非CS类的分类。真实世界的CS数据集可能很嘈杂,例如本研究中使用的巴基斯坦人口与健康调查(PDHS)数据集。HGSO可以通过避免局部极小值点来提供RF的微调超参数。要比较性能,高斯朴素贝叶斯(GNB),线性判别分析(LDA),K-最近邻(KNN),梯度增强分类器(GBC),和逻辑回归(LR)已经在这项研究中考虑。自适应综合(ADASYN)算法已被用来平衡模型,并将提出的HGSORF与其他分类器以及其他研究进行了比较。HGSORF实现了卓越的性能,PDHS数据集的准确率为98.33%。RF的超参数也通过使用常用的超参数优化算法进行了优化,所提出的HGSORF提供了相对更好的性能。此外,分析CS的成因及其意义,HGSORF在本地和全球使用基于可解释的人工智能(XAI)的工具进行解释,例如形状加法扩展(SHAP)和本地可解释模型-不可知解释(LIME)。决策支持系统已被开发为支持临床人员的潜在应用。所有预训练模型和相关代码均可在以下网站获得:https://github.com/MIrazul29/HGSORF_CSection。
    A stable predictive model is essential for forecasting the chances of cesarean or C-section (CS) delivery, as unnecessary CS delivery can adversely affect neonatal, maternal, and pediatric morbidity and mortality, and can incur significant financial burdens. Limited state-of-the-art machine learning models have been applied in this area in recent years, and the current models are insufficient to correctly predict the probability of CS delivery. To alleviate this drawback, we have proposed a Henry gas solubility optimization (HGSO)-based random forest (RF), with an improved objective function, called HGSORF, for the classification of CS and non-CS classes. Real-world CS datasets can be noisy, such as the Pakistan Demographic and Health Survey (PDHS) dataset used in this study. The HGSO can provide fine-tuned hyperparameters of RF by avoiding local minima points. To compare performance, Gaussian Naive Bayes (GNB), linear discriminant analysis (LDA), K-nearest neighbors (KNN), gradient boosting classifier (GBC), and logistic regression (LR) have been considered in this research. The ADAptive SYNthetic (ADASYN) algorithm has been used to balance the model, and the proposed HGSORF has been compared with other classifiers as well as with other studies. The superior performance was achieved by HGSORF with an accuracy of 98.33% for the PDHS dataset. The hyperparameters of RF have also been optimized by using commonly used hyperparameter-optimization algorithms, and the proposed HGSORF provided comparatively better performance. Additionally, to analyze the causes of CS and their significance, the HGSORF is explained locally and globally using eXplainable artificial intelligence (XAI)-based tools such as SHapely Additive exPlanation (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). A decision support system has been developed as a potential application to support clinical staffs. All pre-trained models and relevant codes are available on: https://github.com/MIrazul29/HGSORF_CSection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于致命的病毒,整个世界都面临着大流行的局面,即COVID-19。需要相当长的时间才能使病毒成熟才能被追踪,在这段时间里,它可能会在其他人之间传播。为了摆脱这种意想不到的情况,需要快速识别COVID-19患者。我们设计和优化了一个基于机器学习的框架,使用住院患者的设施数据,成本效益高,以及对这一流行病的及时有效的解决方案。提出的框架使用贝叶斯优化来优化分类器的超参数和自适应综合(ADASYN)算法来平衡数据集的COVID和非COVID类。尽管所提出的技术已应用于九个最先进的分类器以显示其功效,它可以用于许多分类器和分类问题。从这项研究中可以明显看出,极限梯度提升(XGB)提供了97.00%的最高Kappa指数。与没有ADASYN相比,我们提出的方法使kappa指数提高了96.94%。此外,贝叶斯优化已与网格搜索进行了比较,随机搜索以显示效率。此外,使用形状自适应扩张(SHAP)分析已经确定了最主要的特征。其他相关作品也进行了比较。与传统技术相比,所提出的方法能够足够追踪COVID患者花费更少的时间。最后,两个潜在的应用,即,临床可操作决策树和决策支持系统,已被证明可以支持临床工作人员并建立推荐系统。
    The whole world faces a pandemic situation due to the deadly virus, namely COVID-19. It takes considerable time to get the virus well-matured to be traced, and during this time, it may be transmitted among other people. To get rid of this unexpected situation, quick identification of COVID-19 patients is required. We have designed and optimized a machine learning-based framework using inpatient\'s facility data that will give a user-friendly, cost-effective, and time-efficient solution to this pandemic. The proposed framework uses Bayesian optimization to optimize the hyperparameters of the classifier and ADAptive SYNthetic (ADASYN) algorithm to balance the COVID and non-COVID classes of the dataset. Although the proposed technique has been applied to nine state-of-the-art classifiers to show the efficacy, it can be used to many classifiers and classification problems. It is evident from this study that eXtreme Gradient Boosting (XGB) provides the highest Kappa index of 97.00%. Compared to without ADASYN, our proposed approach yields an improvement in the kappa index of 96.94%. Besides, Bayesian optimization has been compared to grid search, random search to show efficiency. Furthermore, the most dominating features have been identified using SHapely Adaptive exPlanations (SHAP) analysis. A comparison has also been made among other related works. The proposed method is capable enough of tracing COVID patients spending less time than that of the conventional techniques. Finally, two potential applications, namely, clinically operable decision tree and decision support system, have been demonstrated to support clinical staff and build a recommender system.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Many countries have attempted to monitor and predict harmful algal blooms to mitigate related problems and establish management practices. The current alert system-based sampling of cell density is used to intimate the bloom status and to inform rapid and adequate response from water-associated organizations. The objective of this study was to develop an early warning system for cyanobacterial blooms to allow for efficient decision making prior to the occurrence of algal blooms and to guide preemptive actions regarding management practices. In this study, two machine learning models: artificial neural network (ANN) and support vector machine (SVM), were constructed for the timely prediction of alert levels of algal bloom using eight years\' worth of meteorological, hydrodynamic, and water quality data in a reservoir where harmful cyanobacterial blooms frequently occur during summer. However, the proportion imbalance on all alert level data as the output variable leads to biased training of the data-driven model and degradation of model prediction performance. Therefore, the synthetic data generated by an adaptive synthetic (ADASYN) sampling method were used to resolve the imbalance of minority class data in the original data and to improve the prediction performance of the models. The results showed that the overall prediction performance yielded by the caution level (L1) and warning level (L2) in the models constructed using a combination of original and synthetic data was higher than the models constructed using original data only. In particular, the optimal ANN and SVM constructed using a combination of original and synthetic data during both training (including validation) and test generated distinctively improved recall and precision values of L1, which is a very critical alert level as it indicates a transition status from normalcy to bloom formation. In addition, both optimal models constructed using synthetic-added data exhibited improvement in recall and precision by more than 33.7% while predicting L-1 and L-2 during the test. Therefore, the application of synthetic data can improve detection performance of machine learning models by solving the imbalance of observed data. Reliable prediction by the improved models can be used to aid the design of management practices to mitigate algal blooms within a reservoir.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号