Direct Zero-Norm Optimization for Feature Selection

Zero-norm, defined as the number of non-zero elements in a vector, is an ideal quantity for feature selection. However, minimization of zero-norm is generally regarded as a combinatorially difficult optimization problem. In contrast to previous methods that usually optimize a surrogate of zero-norm, we propose a direct optimization method to achieve zero-norm for feature selection in this paper. Based on Expectation Maximization (EM), this method boils down to solving a sequence of Quadratic Programming problems and hence can be practically optimized in polynomial time. We show that the proposed optimization technique has a nice Bayesian interpretation and converges to the true zero norm asymptotically, provided that a good starting point is given. Following the scheme of our proposed zero-norm, we even show that an arbitrary-norm based Support Vector Machine can be achieved in polynomial time. A series of experiments demonstrate that our proposed EM based zero-norm outperforms other state-of-the-art methods for feature selection on biological microarray data and UCI data, in terms of both the accuracy and the learning efficiency.

[1]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[5]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[6]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[7]  Anil K. Jain,et al.  Unsupervised selection and estimation of finite mixture models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[8]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[9]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[10]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[12]  Lai-Wan Chan,et al.  The Minimum Error Minimax Probability Machine , 2004, J. Mach. Learn. Res..

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[15]  Alexander J. Smola,et al.  Minimal Kernel Classifiers , 2002, J. Mach. Learn. Res..

[16]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .