Sparse Algorithms Are Not Stable: A No-Free-Lunch Theorem

We consider two desired properties of learning algorithms: sparsity and algorithmic stability. Both properties are believed to lead to good generalization ability. We show that these two properties are fundamentally at odds with each other: A sparse algorithm cannot be stable and vice versa. Thus, one has to trade off sparsity and stability in designing a learning algorithm. In particular, our general result implies that ℓ1-regularized regression (Lasso) cannot be stable, while ℓ2-regularized regression is known to have strong stability properties and is therefore not sparse.

[1]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[2]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[3]  Ohad Shamir,et al.  Learnability and Stability in the General Learning Setting , 2009, COLT.

[4]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[5]  Simon Haykin,et al.  Generalized support vector machines , 1999, ESANN.

[6]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[7]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[8]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[9]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[10]  T. Poggio,et al.  Sufficient Conditions for Uniform Stability of Regularization Algorithms , 2009 .

[11]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[12]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[13]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[14]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[15]  Alexandre d'Aspremont,et al.  Full regularization path for sparse principal component analysis , 2007, ICML '07.

[16]  Bernhard Schölkopf,et al.  Generalized Support Vector Machines , 2000 .

[17]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[18]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[21]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[22]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[23]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[24]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[25]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[26]  Massimiliano Pontil,et al.  Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers , 2004, Machine Learning.

[27]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.