关键词: Chi-square statistic Pearson correlation coefficient feature screening mutual information power-law distribution weighted mean squared deviation

来  源:   DOI:10.3390/e22030335   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
In this study, we propose a novel model-free feature screening method for ultrahigh dimensional binary features of binary classification, called weighted mean squared deviation (WMSD). Compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the binary features with probabilities near 0.5. In addition, the asymptotic properties of the proposed method are theoretically investigated under the assumption log p = o ( n ) . The number of features is practically selected by a Pearson correlation coefficient method according to the property of power-law distribution. Lastly, an empirical study of Chinese text classification illustrates that the proposed method performs well when the dimension of selected features is relatively small.
摘要:
在这项研究中,我们提出了一种新颖的无模型特征筛选方法,用于二进制分类的超高维二进制特征,称为加权均方偏差(WMSD)。与卡方统计量和互信息相比,WMSD为概率接近0.5的二进制特征提供了更多机会。此外,在logp=o(n)的假设下,从理论上研究了该方法的渐近性质。实际上,根据幂律分布的特性,通过皮尔逊相关系数方法选择特征的数量。最后,对中文文本分类的实证研究表明,当选定特征的维数相对较小时,该方法表现良好。
公众号