Sparse multinomial logistic regression: fast algorithms and generalization bounds

Recently developed methods for learning sparse classifiers are among the state-of-the-art in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learning-theoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a component-wise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsity-promoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.

[1]  H. Hartley Maximum Likelihood Estimation from Incomplete Data , 1958 .

[2]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[3]  B. Lindsay,et al.  Monotonicity of quadratic-approximation algorithms , 1988 .

[4]  D. Böhning Multinomial logistic regression algorithm , 1992 .

[5]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[6]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[7]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[11]  S. Mallat A wavelet tour of signal processing , 1998 .

[12]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Thore Graepel,et al.  From Margin to Sparsity , 2000, NIPS.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[18]  John Shawe-Taylor,et al.  Generalisation Error Bounds for Sparse Linear Classifiers , 2000, COLT.

[19]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[20]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[21]  Anil K. Jain,et al.  Bayesian learning of sparse classifiers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[23]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[24]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[25]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[26]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[27]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[28]  Michael I. Jordan,et al.  Discussion of Boosting Papers , 2003 .

[29]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Lawrence Carin,et al.  Joint classifier and feature optimization for cancer diagnosis using gene expression data , 2003, RECOMB '03.

[31]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Ruslan Salakhutdinov,et al.  Adaptive Overrelaxed Bound Optimization Methods , 2003, ICML.

[33]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[34]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[35]  John Shawe-Taylor,et al.  PAC Bayes and Margins , 2003 .

[36]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[37]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[38]  Lawrence Carin,et al.  A Bayesian approach to joint feature selection and classifier design , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Yuan Qi,et al.  Predictive automatic relevance determination by expectation propagation , 2004, ICML.

[40]  John Langford,et al.  Microchoice Bounds and Self Bounding Learning Algorithms , 2003, Machine Learning.

[41]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[42]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[43]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[44]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .