synthetic minority oversampling technique

合成少数过采样技术
  • 文章类型: Journal Article
    步态分析研究在罕见疾病患者中的可解释性,如原发性遗传性小脑共济失调(pwCA),经常受到样本量小和数据集不平衡的限制。这项研究的目的是评估数据平衡和生成人工智能(AI)算法在生成反映pwCA实际步态异常的合成数据方面的有效性。30pwCA的步态数据(年龄:51.6±12.2岁;女性13,17名男性)和100名健康受试者(年龄:57.1±10.4;60名女性,用惯性测量单元在腰部收集40名男性)。二次采样,过采样,合成少数过采样,生成对抗网络,和条件表格生成对抗网络(ctGAN)被用来生成要输入到随机森林分类器的数据集。还计算一致性和可解释性度量以评估生成的数据集与pwCA的已知步态异常的一致性。与原始数据集和传统数据增强方法相比,ctGAN显著提高了分类性能。CTGAN是平衡罕见疾病人群表格数据集的有效方法,由于它们能够改善具有一致可解释性的诊断模型。
    The interpretability of gait analysis studies in people with rare diseases, such as those with primary hereditary cerebellar ataxia (pwCA), is frequently limited by the small sample sizes and unbalanced datasets. The purpose of this study was to assess the effectiveness of data balancing and generative artificial intelligence (AI) algorithms in generating synthetic data reflecting the actual gait abnormalities of pwCA. Gait data of 30 pwCA (age: 51.6 ± 12.2 years; 13 females, 17 males) and 100 healthy subjects (age: 57.1 ± 10.4; 60 females, 40 males) were collected at the lumbar level with an inertial measurement unit. Subsampling, oversampling, synthetic minority oversampling, generative adversarial networks, and conditional tabular generative adversarial networks (ctGAN) were applied to generate datasets to be input to a random forest classifier. Consistency and explainability metrics were also calculated to assess the coherence of the generated dataset with known gait abnormalities of pwCA. ctGAN significantly improved the classification performance compared with the original dataset and traditional data augmentation methods. ctGAN are effective methods for balancing tabular datasets from populations with rare diseases, owing to their ability to improve diagnostic models with consistent explainability.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:术后谵妄,尤其是在老年患者腹部癌症手术后,在临床管理中提出了重大挑战。
    目的:建立一种基于合成少数过采样技术(SMOTE)的老年腹部肿瘤患者术后谵妄预测模型。
    方法:在这项回顾性队列研究中,我们分析了2020年9月至2022年10月在我院接受腹部恶性肿瘤手术的611例老年患者的数据.术后7d记录术后谵妄发生率。根据术后是否发生谵妄分为谵妄组和非谵妄组。使用多变量逻辑回归模型来识别危险因素并建立术后谵妄的预测模型。应用SMOTE技术通过过度采样谵妄病例来增强模型。然后验证了模型的预测准确性。
    结果:在我们的研究中,包括611例老年腹部恶性肿瘤患者,多因素logistic回归分析确定了术后谵妄的重要危险因素。其中包括Charlson合并症指数,美国麻醉医师学会分类,脑血管病史,手术时间,围手术期输血,术后疼痛评分。术后谵妄发生率为22.91%。原始预测模型(P1)显示出受试者工作特性曲线下的面积为0.862。相比之下,基于SMOTE的逻辑预警模型(P2),它利用了SMOTE过采样算法,显示0.856的曲线下面积略低但相当,表明两种预测方法之间的性能没有显着差异。
    结论:这项研究证实,SMOTE增强的老年腹部肿瘤患者术后谵妄预测模型表现出与传统方法相当的性能,有效解决数据不平衡问题。
    BACKGROUND: Postoperative delirium, particularly prevalent in elderly patients after abdominal cancer surgery, presents significant challenges in clinical management.
    OBJECTIVE: To develop a synthetic minority oversampling technique (SMOTE)-based model for predicting postoperative delirium in elderly abdominal cancer patients.
    METHODS: In this retrospective cohort study, we analyzed data from 611 elderly patients who underwent abdominal malignant tumor surgery at our hospital between September 2020 and October 2022. The incidence of postoperative delirium was recorded for 7 d post-surgery. Patients were divided into delirium and non-delirium groups based on the occurrence of postoperative delirium or not. A multivariate logistic regression model was used to identify risk factors and develop a predictive model for postoperative delirium. The SMOTE technique was applied to enhance the model by oversampling the delirium cases. The model\'s predictive accuracy was then validated.
    RESULTS: In our study involving 611 elderly patients with abdominal malignant tumors, multivariate logistic regression analysis identified significant risk factors for postoperative delirium. These included the Charlson comorbidity index, American Society of Anesthesiologists classification, history of cerebrovascular disease, surgical duration, perioperative blood transfusion, and postoperative pain score. The incidence rate of postoperative delirium in our study was 22.91%. The original predictive model (P1) exhibited an area under the receiver operating characteristic curve of 0.862. In comparison, the SMOTE-based logistic early warning model (P2), which utilized the SMOTE oversampling algorithm, showed a slightly lower but comparable area under the curve of 0.856, suggesting no significant difference in performance between the two predictive approaches.
    CONCLUSIONS: This study confirms that the SMOTE-enhanced predictive model for postoperative delirium in elderly abdominal tumor patients shows performance equivalent to that of traditional methods, effectively addressing data imbalance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    心脏病是全球死亡的主要原因,也是许多人的主要公共卫生问题。常规临床数据分析提出的一个主要问题是对心血管疾病的识别,包括心脏病发作和冠状动脉疾病,即使早期发现心脏病可以挽救许多生命。可以用机器学习(ML)以有效的方式实现准确的预测和决策辅助。大数据,或者卫生部门产生的大量数据,可以通过揭示隐藏的信息或复杂的模式来帮助用于进行诊断选择的模型。本文使用一种混合深度学习算法来描述一种用于心脏病检测的大数据分析和可视化方法。拟议的方法旨在与大数据系统一起使用,例如ApacheHadoop。首先对广泛的医疗数据收集进行改进的k均值聚类(IKC)方法以去除异常值,然后使用合成少数过采样技术(SMOTE)平衡剩余类分布。下一步是在递归特征消除(RFE)确定哪些特征之后,使用基于生物启发的基于混合突变的群体智能(HMSI)和基于注意力的门控递归单元网络(AttGRU)模型来预测疾病最重要。在我们的实施中,我们比较了四种机器学习算法:SAE+ANN(稀疏自动编码器+人工神经网络),LR(逻辑回归),KNN(K-最近邻),和天真贝叶斯。实验结果表明,混合模型对心脏病的预测准确率达到95.42%,有效地胜过并克服了上述相关工作中规定的研究空白。
    Heart disease is a major global cause of mortality and a major public health problem for a large number of individuals. A major issue raised by regular clinical data analysis is the recognition of cardiovascular illnesses, including heart attacks and coronary artery disease, even though early identification of heart disease can save many lives. Accurate forecasting and decision assistance may be achieved in an effective manner with machine learning (ML). Big Data, or the vast amounts of data generated by the health sector, may assist models used to make diagnostic choices by revealing hidden information or intricate patterns. This paper uses a hybrid deep learning algorithm to describe a large data analysis and visualization approach for heart disease detection. The proposed approach is intended for use with big data systems, such as Apache Hadoop. An extensive medical data collection is first subjected to an improved k-means clustering (IKC) method to remove outliers, and the remaining class distribution is then balanced using the synthetic minority over-sampling technique (SMOTE). The next step is to forecast the disease using a bio-inspired hybrid mutation-based swarm intelligence (HMSI) with an attention-based gated recurrent unit network (AttGRU) model after recursive feature elimination (RFE) has determined which features are most important. In our implementation, we compare four machine learning algorithms: SAE + ANN (sparse autoencoder + artificial neural network), LR (logistic regression), KNN (K-nearest neighbour), and naïve Bayes. The experiment results indicate that a 95.42% accuracy rate for the hybrid model\'s suggested heart disease prediction is attained, which effectively outperforms and overcomes the prescribed research gap in mentioned related work.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究利用了一种新颖的深度学习模型,Inception-v3,使用从路易斯安那州收集的五年(2016-2021年)的数据来预测行人撞车的严重程度。最终的数据集包含了与行人属性相关的40个不同的变量,环境条件,和车辆细节。碰撞严重程度分为三类:致命,损伤,也没有受伤。Boruta算法用于确定变量的重要性,并调查行人碰撞严重程度的影响因素,揭示了几个相关的方面,包括行人性别,行人和驾驶员损伤,张贴的速度限制,酒精参与,行人年龄,能见度障碍,道路照明条件,以及行人和驾驶员的状况,包括分心和注意力不集中。为了解决数据不平衡,该研究采用随机欠采样(RUS)和合成少数群体过采样技术(SMOTE)。DeepInsight技术将数值数据转换为图像。随后,使用Inception-v3开发了五个碰撞严重性预测模型,考虑了各种情况,包括原创,欠采样,过采样,欠采样和过采样数据的组合,和前二十五个重要变量。结果表明,在几个性能指标方面,应用过采样和欠采样的模型优于基于其他数据平衡技术的模型。包括准确性,灵敏度,精度,特异性,假阴性率(FNR),假阳性率(FPR),和F1得分。该模型实现了93.5%的预测精度,77.5%,85.9%是致命的,损伤,没有伤害类别,分别。此外,基于几个性能指标和McNemar的测试表明,与传统的机器学习和统计模型相比,Inception-v3深度学习模型的预测性能在统计上更优越。这项研究的见解可以被安全专业人员有效地利用,紧急服务提供商,交通管理中心,和车辆制造商加强其安全措施和应用。
    This research leverages a novel deep learning model, Inception-v3, to predict pedestrian crash severity using data collected over five years (2016-2021) from Louisiana. The final dataset incorporates forty different variables related to pedestrian attributes, environmental conditions, and vehicular specifics. Crash severity was classified into three categories: fatal, injury, and no injury. The Boruta algorithm was applied to determine the importance of variables and investigate contributing factors to pedestrian crash severity, revealing several associated aspects, including pedestrian gender, pedestrian and driver impairment, posted speed limits, alcohol involvement, pedestrian age, visibility obstruction, roadway lighting conditions, and both pedestrian and driver conditions, including distraction and inattentiveness. To address data imbalance, the study employed Random Under Sampling (RUS) and the Synthetic Minority Oversampling Technique (SMOTE). The DeepInsight technique transformed numeric data into images. Subsequently, five crash severity prediction models were developed with Inception-v3, considering various scenarios, including original, under-sampled, over-sampled, a combination of under and over-sampled data, and the top twenty-five important variables. Results indicated that the model applying both over and under sampling outperforms models based on other data balancing techniques in terms of several performance metrics, including accuracy, sensitivity, precision, specificity, false negative ratio (FNR), false positive ratio (FPR), and F1-score. This model achieved prediction accuracies of 93.5%, 77.5%, and 85.9% for fatal, injury, and no injury categories, respectively. Additionally, comparative analysis based on several performance metrics and McNemar\'s tests demonstrated that the predictive performance of the Inception-v3 deep learning model is statistically superior compared to traditional machine learning and statistical models. The insights from this research can be effectively harnessed by safety professionals, emergency service providers, traffic management centers, and vehicle manufacturers to enhance their safety measures and applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:脑机接口(BCI)中耗时的数据标记会引起许多问题,例如精神疲劳,并且是阻碍基于运动图像(MI)的BCI在现实世界中采用的关键因素之一。另一种方法是整合现成的,以及信息,未标记的在线数据,而这种方法研究较少。
    方法:我们提出了一种在线半监督学习方案,以提高基于MI的BCI的分类性能。该方案使用正则化加权在线顺序极限学习机(RWOS-ELM)作为基础分类器,并使用传入的平衡数据逐块更新其模型参数。在初始阶段,我们设计了一种技术,将合成的少数过采样与编辑的最近邻规则相结合,用于数据增强,以构造更多的判别初始分类器。在线使用时,传入的数据块首先由RWOS-ELM以及辅助分类器伪标记,然后通过上述技术再次平衡。基于这些类平衡数据进一步更新初始分类器。
    结果:在两个公开可用的MI数据集上的离线实验结果证明了所提出的方案相对于其对应物的优越性。对六个受试者的进一步在线实验表明,通过从传入的未标记数据中学习,他们的BCI表现逐渐提高。
    结论:我们提出的在线半监督学习方案具有更高的计算和内存使用效率,这对于基于MI的在线BCI来说是有希望的,特别是在标记的训练数据不足的情况下。
    Time-consuming data labeling in brain-computer interfaces (BCIs) raises many problems such as mental fatigue and is one key factor that hinders the real-world adoption of motor imagery (MI)-based BCIs. An alternative approach is to integrate readily available, as well as informative, unlabeled data online, whereas this approach is less investigated.
    We proposed an online semi-supervised learning scheme to improve the classification performance of MI-based BCI. This scheme uses regularized weighted online sequential extreme learning machine (RWOS-ELM) as the base classifier and updates its model parameters with incoming balanced data chunk-by-chunk. In the initial stage, we designed a technique that combines the synthetic minority oversampling with the edited nearest neighbor rule for data augmentation to construct more discriminative initial classifiers. When used online, the incoming chunk of data is first pseudo-labeled by RWOS-ELM as well as an auxiliary classifier, and then balanced again by the above-mentioned technique. Initial classifiers are further updated based on these class-balanced data.
    Offline experimental results on two publicly available MI datasets demonstrate the superiority of the proposed scheme over its counterparts. Further online experiments on six subjects show that their BCI performance gradually improved by learning from incoming unlabeled data.
    Our proposed online semi-supervised learning scheme has higher computation and memory usage efficiency, which is promising for online MI-based BCIs, especially in the case of insufficient labeled training data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    这项工作提供了一种创新的根管治疗(RCT)期间牙髓仪器故障检测方法。有时候,牙髓器械容易从尖端骨折,原因不确定牙医的控制。牙髓医生的全面评估和决策支持系统可以避免几次破损。这项研究提出了一种基于机器学习和人工智能的方法,可以帮助诊断仪器的健康状况。在RCT期间,使用测力计记录力信号。从获得的信号中,提取统计特征。因为少数阶级(即有缺陷/中等阶级)的实例较少,需要对数据集进行过采样,以避免偏差和过拟合。因此,采用合成少数群体过采样技术(SMOTE)来增加少数群体。Further,使用机器学习技术评估性能,即高斯朴素贝叶斯(GNB),二次支持向量机(QSVM),精细k最近邻(FKNN),和合奏袋装树(EBT)。EBT模型相对于GNB提供了出色的性能,QSVM,还有FKNN.机器学习(ML)算法可以通过监测力信号来准确检测牙髓仪器的故障。EBT和FKNN分类器训练得非常好,曲线下面积值为1.0和0.99,预测精度为98.95和97.56%,分别。ML可以潜在地增强临床结果,促进学习,减少进程故障,提高治疗效果,并提高仪器性能,有助于优越的RCT过程。这项工作使用ML方法进行牙髓仪器的故障检测,为从业者提供足够的决策支持系统。
    This work provides an innovative endodontic instrument fault detection methodology during root canal treatment (RCT). Sometimes, an endodontic instrument is prone to fracture from the tip, for causes uncertain the dentist\'s control. A comprehensive assessment and decision support system for an endodontist may avoid several breakages. This research proposes a machine learning and artificial intelligence-based approach that can help to diagnose instrument health. During the RCT, force signals are recorded using a dynamometer. From the acquired signals, statistical features are extracted. Because there are fewer instances of the minority class (i.e. faulty/moderate class), oversampling of datasets is required to avoid bias and overfitting. Therefore, the synthetic minority oversampling technique (SMOTE) is employed to increase the minority class. Further, evaluating the performance using the machine learning techniques, namely Gaussian Naïve Bayes (GNB), quadratic support vector machine (QSVM), fine k-nearest neighbor (FKNN), and ensemble bagged tree (EBT). The EBT model provides excellent performance relative to the GNB, QSVM, and FKNN. Machine learning (ML) algorithms can accurately detect endodontic instruments\' faults by monitoring the force signals. The EBT and FKNN classifier is trained exceptionally well with an area under curve values of 1.0 and 0.99 and prediction accuracy of 98.95 and 97.56%, respectively. ML can potentially enhance clinical outcomes, boost learning, decrease process malfunctions, increase treatment efficacy, and enhance instrument performance, contributing to superior RCT processes. This work uses ML methodologies for fault detection of endodontic instruments, providing practitioners with an adequate decision support system.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    称为拭子测试的实时聚合酶链反应(RT-PCR)是一种诊断测试,可以通过实验室中的呼吸道样本诊断COVID-19疾病。由于冠状病毒在世界各地的迅速传播,RT-PCR检测已不足以获得快速结果。出于这个原因,诊断方法来填补这一空白的需求已经出现,机器学习研究已经开始在这一领域。另一方面,研究医疗数据是一个具有挑战性的领域,因为它包含的数据不一致,不完整,难以缩放,而且很大.此外,一些糟糕的临床决定,无关参数,和有限的医疗数据不利地影响所进行的研究的准确性。因此,考虑到包含COVID-19血液参数的数据集的可用性,比今天的其他医疗数据集数量少,它旨在改进这些现有的数据集。在这个方向上,为了在COVID-19机器学习研究中获得更一致的结果,本研究调查了数据预处理技术对COVID-19数据分类的影响.在这项研究中,主要将编码分类特征和特征缩放过程应用于包含279名患者血液数据的15个特征的数据集,包括性别和年龄信息。然后,通过使用K最近邻算法(KNN)和链式方程多值分配(MICE)方法消除了数据集的错误。数据平衡是用合成少数过采样技术(SMOTE)完成的,这是一种数据平衡方法。数据预处理技术对集成学习算法装袋的影响,AdaBoost,随机森林和流行的分类器算法KNN分类器,支持向量机,逻辑回归,人工神经网络,并对决策树分类器进行了分析。通过应用SMOTE,使用KNN和MICE估算,套袋分类器获得的最高精度分别为83.42%和83.74%,分别。另一方面,在没有SMOTE的情况下,相同分类器达到的最高准确率为83.91%。总之,对某些数据预处理技术进行了比较研究,并介绍了这些数据预处理技术对成功的影响,并通过实验研究证明了正确组合数据预处理对成功的重要性。
    Real-time polymerase chain reaction (RT-PCR) known as the swab test is a diagnostic test that can diagnose COVID-19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT-PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID-19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID-19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID-19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K-nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于深度学习的故障诊断通常需要丰富的数据供应,但是在实践中,故障样本很少,对现有的诊断方法在实际应用中实现高度准确的故障检测提出了相当大的挑战。本文提出了一种将时频特征过采样(TFFO)与卷积神经网络(CNN)相结合的旋转机械不平衡故障诊断方法。首先,滑动分段抽样方法主要用于增加一维信号形式的故障样本数量。紧接着,通过连续小波变换(CWT)将信号转换为二维时频特征图。随后,使用合成少数过采样技术(SMOTE)再次扩展少数样本以实现TFFO。在这样的两倍数据扩展之后,获取平衡数据集,并导入基于LeNet-5的改进2dCNN实现故障诊断。为了验证所提出的方法,在机车轮对轴承和齿轮箱上进行了两个涉及单故障和复合故障的实验,导致几个具有不同不平衡程度和不同信噪比的数据集。结果表明,该方法在分类精度和稳定性以及噪声鲁棒性方面的优势,在不平衡故障诊断中。故障分类准确率达到97%以上。
    Deep learning-based fault diagnosis usually requires a rich supply of data, but fault samples are scarce in practice, posing a considerable challenge for existing diagnosis approaches to achieve highly accurate fault detection in real applications. This paper proposes an imbalanced fault diagnosis of rotatory machinery that combines time-frequency feature oversampling (TFFO) with a convolutional neural network (CNN). First, the sliding segmentation sampling method is employed to primarily increase the number of fault samples in the form of one-dimensional signals. Immediately after, the signals are converted into two-dimensional time-frequency feature maps by continuous wavelet transform (CWT). Subsequently, the minority samples are expanded again using the synthetic minority oversampling technique (SMOTE) to realize TFFO. After such two-fold data expansion, a balanced data set is obtained and imported to an improved 2dCNN based on the LeNet-5 to implement fault diagnosis. In order to verify the proposed method, two experiments involving single and compound faults are conducted on locomotive wheel-set bearings and a gearbox, resulting in several datasets with different imbalanced degrees and various signal-to-noise ratios. The results demonstrate the advantages of the proposed method in terms of classification accuracy and stability as well as noise robustness in imbalanced fault diagnosis, and the fault classification accuracy is over 97%.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    纠正类不平衡(结果事件和非事件的频率之间的不平衡)的方法对开发预测模型越来越感兴趣。我们研究了不平衡校正对逻辑回归模型性能的影响。
    预测模型是使用标准和惩罚(岭)逻辑回归在4种方法下开发的,以解决类失衡:无校正,随机欠采样,随机过采样,和SMOTE。模型性能是根据歧视进行评估的,校准,和分类。使用蒙特卡罗模拟,我们研究了训练集大小的影响,预测因子的数量,和结果事件分数。介绍了卵巢癌诊断预测模型的案例研究。
    使用随机欠采样,随机过采样,或SMOTE产生了校准不佳的模型:属于少数族裔的概率被大大高估了。与未校正类别失衡的模型相比,这些方法并未导致ROC曲线下更高的面积。虽然不平衡校正改善了敏感性和特异性之间的平衡,通过移动概率阈值而获得了类似的结果。
    不平衡校正导致模型具有强烈的错误校准,而没有更好的区分有和没有结果事件的患者的能力。不准确的概率估计降低了模型的临床实用性,因为关于治疗的决定是不明智的。
    结果失衡本身不是问题,不平衡校正甚至可能恶化模型性能。
    Methods to correct class imbalance (imbalance between the frequency of outcome events and nonevents) are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of logistic regression models.
    Prediction models were developed using standard and penalized (ridge) logistic regression under 4 methods to address class imbalance: no correction, random undersampling, random oversampling, and SMOTE. Model performance was evaluated in terms of discrimination, calibration, and classification. Using Monte Carlo simulations, we studied the impact of training set size, number of predictors, and the outcome event fraction. A case study on prediction modeling for ovarian cancer diagnosis is presented.
    The use of random undersampling, random oversampling, or SMOTE yielded poorly calibrated models: the probability to belong to the minority class was strongly overestimated. These methods did not result in higher areas under the ROC curve when compared with models developed without correction for class imbalance. Although imbalance correction improved the balance between sensitivity and specificity, similar results were obtained by shifting the probability threshold instead.
    Imbalance correction led to models with strong miscalibration without better ability to distinguish between patients with and without the outcome event. The inaccurate probability estimates reduce the clinical utility of the model, because decisions about treatment are ill-informed.
    Outcome imbalance is not a problem in itself, imbalance correction may even worsen model performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    没有空气,人类的生存是无法想象的。现代人类社会几乎所有领域的持续发展都对空气的健康产生了不利影响。日常工业,运输,家庭活动正在我们的环境中搅动有害污染物。在这个时代,监测和预测空气质量已经变得至关重要,尤其是像印度这样的发展中国家。与传统方法相比,基于机器学习技术的预测技术被证明是研究此类现代危害的最有效工具。本工作调查了来自23个印度城市的六年空气污染数据,以进行空气质量分析和预测。对数据集进行了很好的预处理,并通过相关性分析选择了关键特征。进行探索性数据分析,以深入了解数据集中的各种隐藏模式,并确定直接影响空气质量指数的污染物。在大流行年,几乎所有污染物都出现了显着下降,2020年。通过重采样技术解决了数据不平衡问题,并采用了五种机器学习模型来预测空气质量。将这些模型的结果与标准度量进行比较。高斯朴素贝叶斯模型具有最高的精度,而支持向量机模型具有最低的精度。通过建立的性能参数对这些模型的性能进行评估和比较。XGBoost模型在其他模型中表现最好,并且在预测数据和实际数据之间获得最高的线性度。
    The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号