Sparse Accelerated Exponential Weights

We consider the stochastic optimization problem where a convex function is minimized observing recursively the gradients. We introduce SAEW, a new procedure that accelerates exponential weights procedures with the slow rate $1/\sqrt{T}$ to procedures achieving the fast rate $1/T$. Under the strong convexity of the risk, we achieve the optimal rate of convergence for approximating sparse parameters in $\mathbb{R}^d$. The acceleration is achieved by using successive averaging steps in an online fashion. The procedure also produces sparse estimators thanks to additional hard threshold steps.

[1]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[2]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[3]  Martin J. Wainwright,et al.  Lower bounds on the performance of polynomial-time algorithms for sparse linear regression , 2014, COLT.

[4]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[5]  Gilles Stoltz,et al.  A second-order bound with excess losses , 2014, COLT.

[6]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[7]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[8]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[9]  Ingo Steinwart,et al.  Estimating conditional quantiles with the help of the pinball loss , 2011, 1102.2101.

[10]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[11]  Sébastien Gerchinovitz Prediction of individual sequences and prediction in the statistical framework : some links around sparse regression and aggregation techniques , 2011 .

[12]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[13]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[16]  Olivier Wintenberger,et al.  Optimal learning with Bernstein online aggregation , 2014, Machine Learning.

[17]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[18]  Sébastien Gerchinovitz,et al.  Sparsity Regret Bounds for Individual Sequences in Online Linear Regression , 2011, COLT.

[19]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[20]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[21]  R. Koenker,et al.  Regression Quantiles , 2007 .

[22]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .