Generalized Fisher Score for Feature Selection

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. In this paper, we present a generalized Fisher score to jointly select features. It aims at finding an subset of features, which maximize the lower bound of traditional Fisher score. The resulting feature selection problem is a mixed integer programming, which can be reformulated as a quadratically constrained linear programming (QCLP). It is solved by cutting plane algorithm, in each iteration of which a multiple kernel learning problem is solved alternatively by multivariate ridge regression and projected gradient descent. Experiments on benchmark data sets indicate that the proposed method outperforms Fisher score as well as many other state-of-the-art feature selection methods.

[1]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[2]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[3]  Ivor W. Tsang,et al.  Tighter and Convex Maximum Margin Clustering , 2009, AISTATS.

[4]  Stephen P. Boyd,et al.  A minimax theorem with applications to machine learning, signal processing, and finance , 2007, CDC.

[5]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Ivor W. Tsang,et al.  Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets , 2010, ICML.

[9]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[10]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[11]  Michael A. Saunders,et al.  LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares , 1982, TOMS.

[12]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[13]  Jieping Ye,et al.  Learning the kernel matrix in discriminant analysis via quadratically constrained quadratic programming , 2007, KDD '07.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[16]  Gene H. Golub,et al.  Matrix computations , 1983 .

[17]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[18]  Yves Grandvalet,et al.  More efficiency in multiple kernel learning , 2007, ICML '07.

[19]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Michelangelo Ceci,et al.  Redundant feature elimination for multi-class problems , 2004, ICML.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[24]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[25]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.