关键词: Classification Data mining Imbalance Medical dataset Oversampling SMOTE Thalassemia

Mesh : Asymptomatic Diseases Bayes Theorem Biomarkers / blood Data Mining / methods Databases, Factual Decision Trees Erythrocyte Indices Genetic Carrier Screening / methods Hemoglobins / analysis genetics Heterozygote Humans Middle East Neural Networks, Computer Phenotype Predictive Value of Tests Reproducibility of Results beta-Thalassemia / blood classification diagnosis genetics

来  源:   DOI:10.1016/j.artmed.2018.04.009

Abstract:
Thalassemia is considered one of the most common genetic blood disorders that has received excessive attention in the medical research fields worldwide. Under this context, one of the greatest challenges for healthcare professionals is to correctly differentiate normal individuals from asymptomatic thalassemia carriers. Usually, thalassemia diagnosis is based on certain measurable characteristic changes to blood cell counts and related indices. These characteristic changes can be derived easily when performing a complete blood count test (CBC) using a special fully automated blood analyzer or counter. However, the reliability of the CBC test alone is questionable with possible candidate characteristics that could be seen in other disorders, leading to misdiagnosis of thalassemia. Therefore, other costly and time-consuming tests should be performed that may cause serious consequences due to the delay in the correct diagnosis. To help overcoming these challenging diagnostic issues, this work presents a new novel dataset collected from Palestine Avenir Foundation for persons tested for thalassemia. We aim to compile a gold standard dataset for thalassemia and make it available for researchers in this field. Moreover, we use this dataset to predict the specific type of thalassemia known as beta thalassemia (β-thalassemia) based on hybrid data mining model. The proposed model consists of two main steps. First, to overcome the problem of the highly imbalanced class distribution in the dataset, a balancing technique called SMOTE is proposed and applied to handle this problem. In the second step, four classification models, namely k-nearest neighbors (k-NN), naïve Bayesian (NB), decision tree (DT) and the multilayer perceptron (MLP) neural network are used to differentiate between normal persons and those patients carrying β-thalassemia. Different evaluation metrics are used to assess the performance of the proposed model. The experimental results show that the SMOTE oversampling method can effectively improve the identification ratio of β-thalassemia carriers in a highly imbalanced class distribution. The results reveal also that the NB classifier achieved the best performance in differentiating between normal and β-thalassemia carriers at oversampling SMOTE ratio of 400%. This combination shows a specificity of 99.47% and a sensitivity of 98.81%.
摘要:
地中海贫血被认为是最常见的遗传性血液疾病之一,在全球医学研究领域受到了过度关注。在此背景下,医疗保健专业人员面临的最大挑战之一是正确区分正常人和无症状地中海贫血携带者.通常,地中海贫血诊断基于血细胞计数和相关指标的某些可测量特征变化。使用特殊的全自动血液分析仪或计数器进行全血细胞计数测试(CBC)时,可以轻松得出这些特征变化。然而,仅CBC测试的可靠性与可能在其他疾病中看到的候选特征有关,导致地中海贫血误诊。因此,应进行其他昂贵且耗时的测试,这些测试可能会由于正确诊断的延迟而导致严重后果。为了帮助克服这些具有挑战性的诊断问题,这项工作提供了从巴勒斯坦Avenir基金会收集的新数据集,用于接受地中海贫血测试的人。我们的目标是编制地中海贫血的黄金标准数据集,并将其提供给该领域的研究人员。此外,我们使用这个数据集预测特定类型的地中海贫血称为β-地中海贫血(β-地中海贫血)基于混合数据挖掘模型。所提出的模型包括两个主要步骤。首先,为了克服数据集中高度不平衡的类分布问题,提出了一种称为SMOTE的平衡技术,并将其应用于解决此问题。第二步,四种分类模型,即k-最近邻(k-NN),朴素贝叶斯(NB),决策树(DT)和多层感知器(MLP)神经网络用于区分正常人和携带β-地中海贫血的患者。使用不同的评估指标来评估所提出的模型的性能。实验结果表明,SMOTE过采样方法可以有效提高高度不平衡类别分布中β-地中海贫血携带者的识别率。结果还表明,在400%的过采样SMOTE比率下,NB分类器在区分正常和β-地中海贫血携带者方面取得了最佳性能。该组合显示99.47%的特异性和98.81%的灵敏度。
公众号