Feature selection, L1 vs. L2 regularization, and rotational invariance

We consider supervised learning in the presence of very many irrelevant features, and study two different regularization methods for preventing overfitting. Focusing on logistic regression, we show that using L1 regularization of the parameters, the sample complexity (i.e., the number of training examples required to learn "well,") grows only logarithmically in the number of irrelevant features. This logarithmic rate matches the best known bounds for feature selection, and indicates that L1 regularized logistic regression can be effective even if there are exponentially many irrelevant features as there are training examples. We also give a lower-bound showing that any rotationally invariant algorithm---including logistic regression with L2 regularization, SVMs, and neural networks trained by backpropagation---has a worst case sample complexity that grows at least linearly in the number of irrelevant features.

[1]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[2]  R. Bordley A Multiplicative Formula for Aggregating Probability Assessments , 1982 .

[3]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[4]  Anne Lohrli Chapman and Hall , 1985 .

[5]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[6]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[7]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[8]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[9]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[12]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[13]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[14]  Tom Heskes,et al.  Selecting Weighting Factors in Logarithmic Opinion Pools , 1997, NIPS.

[15]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[16]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Andrew Y. Ng,et al.  On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples , 1998, ICML.

[19]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[20]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[21]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[22]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[23]  Thomas Hofmann,et al.  Learning What People (Don't) Want , 2001, ECML.

[24]  Michael I. Jordan,et al.  Convergence rates of the Voting Gibbs classifier, with application to Bayesian feature selection , 2001, ICML.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[27]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[28]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[29]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.

[30]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[31]  Michael I. Jordan,et al.  Statistical Debugging of Sampled Programs , 2003, NIPS.

[32]  Benjamin M. Marlin,et al.  Modeling User Rating Profiles For Collaborative Filtering , 2003, NIPS.

[33]  Benjamin M. Marlin,et al.  Collaborative Filtering: A Machine Learning Perspective , 2004 .

[34]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .