Oversampling

过采样
  • 文章类型: Journal Article
    Driving daily through traffic congestion has been recognised as a major cause of stress. High levels of stress while driving negatively impact the driver\'s decisions which could potentially lead to accidents and other long-term health hazards. Accordingly, there is a great need to determine stress levels for drivers based on measuring and predicting the major causes (features or classes) that increase stress levels. In this paper, the problem of predicting automobile drivers\' stress levels, as experienced during actual driving, is investigated through the application of five different data mining algorithms, namely K-Nearest Neighbour (KNN), Decision Tree (J48), Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Networks (ANN). An experiment was conducted on 14 drivers taking various routes in Amman - Jordan, with a wearable biomedical device attached to the driver to instantly collect physiological data. The collected data (dataset) is grouped into two different categories, namely \'Yes\' to signify the presence of stress and \'No\' to signify the absence of stress. In order to efficiently apply data mining algorithms to the data set, oversampling was used to avoid the negative effect of driver samples with a lesser class on the prediction of stress. The findings are evaluated in relation to stress prediction and accordingly contrasted alongside standard reference approaches that do not consider oversampling and/or feature selection using the Friedman rank test. The proposed approach, in combination with RF, was seen to surpass any others in terms of accuracy, AUC, specificity, and sensitivity. The accuracy, AUC, specificity, and sensitivity rates produced by RF utilising our proposed approach were 98.92%, 99.91%, 98.46%, and 99.36%, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    地中海贫血被认为是最常见的遗传性血液疾病之一,在全球医学研究领域受到了过度关注。在此背景下,医疗保健专业人员面临的最大挑战之一是正确区分正常人和无症状地中海贫血携带者.通常,地中海贫血诊断基于血细胞计数和相关指标的某些可测量特征变化。使用特殊的全自动血液分析仪或计数器进行全血细胞计数测试(CBC)时,可以轻松得出这些特征变化。然而,仅CBC测试的可靠性与可能在其他疾病中看到的候选特征有关,导致地中海贫血误诊。因此,应进行其他昂贵且耗时的测试,这些测试可能会由于正确诊断的延迟而导致严重后果。为了帮助克服这些具有挑战性的诊断问题,这项工作提供了从巴勒斯坦Avenir基金会收集的新数据集,用于接受地中海贫血测试的人。我们的目标是编制地中海贫血的黄金标准数据集,并将其提供给该领域的研究人员。此外,我们使用这个数据集预测特定类型的地中海贫血称为β-地中海贫血(β-地中海贫血)基于混合数据挖掘模型。所提出的模型包括两个主要步骤。首先,为了克服数据集中高度不平衡的类分布问题,提出了一种称为SMOTE的平衡技术,并将其应用于解决此问题。第二步,四种分类模型,即k-最近邻(k-NN),朴素贝叶斯(NB),决策树(DT)和多层感知器(MLP)神经网络用于区分正常人和携带β-地中海贫血的患者。使用不同的评估指标来评估所提出的模型的性能。实验结果表明,SMOTE过采样方法可以有效提高高度不平衡类别分布中β-地中海贫血携带者的识别率。结果还表明,在400%的过采样SMOTE比率下,NB分类器在区分正常和β-地中海贫血携带者方面取得了最佳性能。该组合显示99.47%的特异性和98.81%的灵敏度。
    Thalassemia is considered one of the most common genetic blood disorders that has received excessive attention in the medical research fields worldwide. Under this context, one of the greatest challenges for healthcare professionals is to correctly differentiate normal individuals from asymptomatic thalassemia carriers. Usually, thalassemia diagnosis is based on certain measurable characteristic changes to blood cell counts and related indices. These characteristic changes can be derived easily when performing a complete blood count test (CBC) using a special fully automated blood analyzer or counter. However, the reliability of the CBC test alone is questionable with possible candidate characteristics that could be seen in other disorders, leading to misdiagnosis of thalassemia. Therefore, other costly and time-consuming tests should be performed that may cause serious consequences due to the delay in the correct diagnosis. To help overcoming these challenging diagnostic issues, this work presents a new novel dataset collected from Palestine Avenir Foundation for persons tested for thalassemia. We aim to compile a gold standard dataset for thalassemia and make it available for researchers in this field. Moreover, we use this dataset to predict the specific type of thalassemia known as beta thalassemia (β-thalassemia) based on hybrid data mining model. The proposed model consists of two main steps. First, to overcome the problem of the highly imbalanced class distribution in the dataset, a balancing technique called SMOTE is proposed and applied to handle this problem. In the second step, four classification models, namely k-nearest neighbors (k-NN), naïve Bayesian (NB), decision tree (DT) and the multilayer perceptron (MLP) neural network are used to differentiate between normal persons and those patients carrying β-thalassemia. Different evaluation metrics are used to assess the performance of the proposed model. The experimental results show that the SMOTE oversampling method can effectively improve the identification ratio of β-thalassemia carriers in a highly imbalanced class distribution. The results reveal also that the NB classifier achieved the best performance in differentiating between normal and β-thalassemia carriers at oversampling SMOTE ratio of 400%. This combination shows a specificity of 99.47% and a sensitivity of 98.81%.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    A clustering problem involving multivariate time series (MTS) requires the selection of similarity metrics. This paper shows the limitations of the PCA similarity factor (SPCA) as a single metric in nonlinear problems where there are differences in magnitude of the same process variables due to expected changes in operation conditions. A novel method for clustering MTS based on a combination between SPCA and the average-based Euclidean distance (AED) within a fuzzy clustering approach is proposed. Case studies involving either simulated or real industrial data collected from a large scale gas turbine are used to illustrate that the hybrid approach enhances the ability to recognize normal and fault operating patterns. This paper also proposes an oversampling procedure to create synthetic multivariate time series that can be useful in commonly occurring situations involving unbalanced data sets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号