Exponentiated) Stochastic Gradient Descent for L1 Constrained Problems

This note is by Sham Kakade, Dean Foster, and Eyal Even-Dar. It is intended as an introductory piece on solving L1 constrained problems with online methods. Convex optimization problems with L1 constraints frequently underly solving such tasks as feature selection problems and obtaining sparse representations. This note shows that the exponentiated gradient algorithm (of Kivinen and Warmuth (1997)) when used as a stochastic gradient descent algorithm is quite effective as an optimization tool under general convex loss functions — requiring a number of gradient steps that is logarithmic in the number of dimensions under mild assumptions. In particular, for supervised learning problems in which we desire to approximately minimize some general convex loss (including the square, logistic, hinge, or absolute loss) in the presence of many irrelevant features, this algorithm is efficient — with a sample complexity that is only logarithmic in the total number of features and a computational complexity that is only linear in the total number of features (ignoring log factors).

[1]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[3]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[4]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[5]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[6]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[7]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[10]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[11]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[12]  W. Welch Algorithmic complexity: three NP- hard problems in computational statistics , 1982 .

[13]  Alan J. Miller Subset Selection in Regression , 1992 .

[14]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.