无监督的特征选择，以识别重要的 ICD - 10 和 ATC 代码的机器学习的冠心病患者队列：回顾性研究。Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: The application of machine learning in health care often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications, respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the \"curse of dimensionality\" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD and ATC codes and the hierarchical structures of these systems.
UNASSIGNED: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of patients with coronary artery disease in different aspects of performance and complexity and select the best set of features representing these patients.
UNASSIGNED: We compared several unsupervised feature selection methods for 2 ICD and 1 ATC code databases of 51,506 patients with coronary artery disease in Alberta, Canada. Specifically, we used the Laplacian score, unsupervised feature selection for multicluster data, autoencoder-inspired unsupervised feature selection, principal feature analysis, and concrete autoencoders with and without ICD or ATC tree weight adjustment to select the 100 best features from over 9000 ICD and 2000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of the selected features by mean code level in the ICD or ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis.
UNASSIGNED: In feature space reconstruction and mortality prediction, the concrete autoencoder-based methods outperformed other techniques. Particularly, a weight-adjusted concrete autoencoder variant demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong and McNemar tests (P<.05). Concrete autoencoders preferred more general codes, and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted concrete autoencoders yielded higher Shapley values in mortality prediction than most alternatives.
UNASSIGNED: This study scrutinized 5 feature selection methods in ICD and ATC code data sets in an unsupervised context. Our findings underscore the superiority of the concrete autoencoder method in selecting salient features that represent the entire data set, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the concrete autoencoders specifically tailored for ICD and ATC code data sets to enhance the generalizability and interpretability of the selected features.

摘要：

机器学习在医疗保健中的应用通常需要使用分层代码，例如国际疾病分类（ICD）和解剖治疗化学（ATC）系统。这些代码对疾病和药物进行分类，分别,从而形成广泛的数据维度。无监督特征选择解决了“维度的诅咒”，并通过减少无关或冗余特征的数量并避免过度拟合，有助于提高监督学习模型的准确性和性能。无监督特征选择技术，比如过滤器，包装器,和嵌入式方法，被实现为选择具有最内在信息的最重要的功能。然而,由于ICD和ATC代码的庞大数量以及这些系统的层次结构，他们面临挑战。
■本研究的目的是比较冠状动脉疾病患者ICD和ATC代码数据库的几种无监督特征选择方法的性能和复杂性的不同方面，并选择代表这些患者的最佳特征集。
■我们比较了艾伯塔省51,506名冠状动脉疾病患者的2个ICD和1个ATC代码数据库的几种无监督特征选择方法，加拿大。具体来说,我们用拉普拉斯分数,多集群数据的无监督特征选择，自动编码器启发的无监督特征选择，主要特征分析，和混凝土自动编码器有和没有ICD或ATC树的重量调整，从超过9000ICD和2000ATC代码中选择100个最佳功能。我们根据其重建初始特征空间和预测出院后90天死亡率的能力评估了选定的特征。我们还通过ICD或ATC树中的平均代码级别比较了所选特征的复杂性，以及使用Shapley分析的死亡率预测任务中特征的可解释性。
■在特征空间重构和死亡率预测中，具体的基于自动编码器的方法优于其他技术。特别是,权重调整后的混凝土自动编码器变体展示了改进的重建精度和显著的预测性能增强，经DeLong和McNemar检验证实（P<0.05）。混凝土自动编码器首选更通用的代码，他们一致准确地重建了所有特征。此外,与大多数替代方案相比，通过重量调整的混凝土自动编码器选择的特征在死亡率预测中产生了更高的Shapley值。
■这项研究在无监督的背景下仔细检查了ICD和ATC代码数据集中的5种特征选择方法。我们的发现强调了具体的自动编码器方法在选择代表整个数据集的显着特征方面的优越性，为后续机器学习研究提供潜在资产。我们还为专门为ICD和ATC代码数据集量身定制的具体自动编码器提供了一种新颖的权重调整方法，以增强所选功能的可泛化性和可解释性。