关键词: COVID-19 Imbalanced data Machine learning Oversampling Smoothed bootstrap resampling

来  源:   DOI:10.1016/j.jksuci.2021.09.021   PDF(Pubmed)

Abstract:
The Coronavirus Disease (COVID-19) was declared a pandemic disease by the World Health Organization (WHO), and it has not ended so far. Since the infection rate of the COVID-19 increases, the computational approach is needed to predict patients infected with COVID-19 in order to speed up the diagnosis time and minimize human error compared to conventional diagnoses. However, the number of negative data that is higher than positive data can result in a data imbalance situation that affects the classification performance, resulting in a bias in the model evaluation results. This study proposes a new oversampling technique, i.e., TRIM-SBR, to generate the minor class data for diagnosing patients infected with COVID-19. It is still challenging to develop the oversampling technique due to the data\'s generalization issue. The proposed method is based on pruning by looking for specific minority areas while retaining data generalization, resulting in minority data seeds that serve as benchmarks in creating new synthesized data using bootstrap resampling techniques. Accuracy, Specificity, Sensitivity, F-measure, and AUC are used to evaluate classifier performance in data imbalance cases. The results show that the TRIM-SBR method provides the best performance compared to other oversampling techniques.
摘要:
冠状病毒病(COVID-19)被世界卫生组织(WHO)宣布为大流行疾病,到目前为止还没有结束。由于COVID-19的感染率增加,与常规诊断相比,需要采用计算方法来预测感染COVID-19的患者,以加快诊断时间并最大限度地减少人为错误.然而,负数据的数量高于正数据的数量会导致数据不平衡的情况,从而影响分类性能。导致模型评估结果存在偏差。这项研究提出了一种新的过采样技术,即,TRIM-SBR,生成诊断感染COVID-19患者的次要类别数据。由于数据的泛化问题,开发过采样技术仍然具有挑战性。所提出的方法是基于修剪,通过寻找特定的少数民族地区,同时保留数据泛化,产生少数数据种子,作为使用引导重采样技术创建新的合成数据的基准。准确性,特异性,灵敏度,F-measure,和AUC用于评估数据不平衡情况下的分类器性能。结果表明,与其他过采样技术相比,TRIM-SBR方法提供了最佳性能。
公众号