关键词: case base reasoning data mining dimensionality reduction feature weighting gene expression machine learning

来  源:   DOI:10.4137/CIN.S22371   PDF(Sci-hub)

Abstract:
BACKGROUND: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process.
METHODS: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles.
RESULTS: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children\'s Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set.
CONCLUSIONS: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.
摘要:
背景:在基于案例的推理系统中检索相似案例的过程被认为是基因表达数据集的一大挑战。微阵列技术产生的大量基因表达值导致复杂的数据集,高维数据的相似性度量存在问题。因此,基因表达相似性测量需要大量的机器学习和数据挖掘技术,如特征选择和降维,纳入检索过程。
方法:本文提出了一种基于案例的检索框架,该框架使用具有基于加权特征的相似性的k最近邻分类器来根据先前治疗的患者的基因表达谱检索他们。
结果:本文提出的方法在几个数据集上得到了验证:从Westmead儿童医院收集的儿童白血病数据集,以及结肠癌,国家癌症研究所(NCI),和前列腺癌数据集。通过提出的框架在检索与新患者相似的数据集的患者中获得的结果如下:儿童白血病数据集的准确率为96%,NCI数据集的95%,结肠癌数据集中的93%,和98%的前列腺癌数据集。
结论:设计的基于病例的检索框架是检索与新患者相似的先前患者的适当选择,根据他们的基因表达数据,更好地诊断和治疗儿童白血病。此外,这个框架可以应用于其他基因表达数据集使用一些或所有的步骤。
公众号