关键词: curse of dimensionality dimensionality reduction feature selection information theory principle of parsimony statistical learning variable selection

来  源:   DOI:10.1016/j.patter.2022.100471   PDF(Pubmed)

Abstract:
We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.
摘要:
我们提出了一种新的启发式特征选择(FS)算法,该算法在一个有原则的算法框架中集成了三个关键的FS组件:相关性,冗余,和互补性。因此,我们称之为相关性,冗余,和互补性权衡(RRCT)。每个特征与响应之间以及特征对之间的关联强度通过秩相关系数的信息理论变换来量化,并使用偏相关系数对特征互补性进行量化。我们在4个合成数据集和8个现实世界数据集的19个FS算法中对RRCT的性能进行了经验性的基准测试,评估了以下内容:(1)匹配真实特征集,以及(2)在二进制和多类分类问题中的样本外性能。RRCT在这两项任务中都非常有竞争力,并且我们试探性地对性能最佳的FS算法在可以有效运行的设置中的通用性和应用提出建议。
公众号