Efficient Stochastic Gradient Descent for Strongly Convex Optimization

We motivate this study from a recent work on a stochastic gradient descent (SGD) method with only one projection \citep{DBLP:conf/nips/MahdaviYJZY12}, which aims at alleviating the computational bottleneck of the standard SGD method in performing the projection at each iteration, and enjoys an $O(\log T/T)$ convergence rate for strongly convex optimization. In this paper, we make further contributions along the line. First, we develop an epoch-projection SGD method that only makes a constant number of projections less than $\log_2T$ but achieves an optimal convergence rate $O(1/T)$ for {\it strongly convex optimization}. Second, we present a proximal extension to utilize the structure of the objective function that could further speed-up the computation and convergence for sparse regularized loss minimization problems. Finally, we consider an application of the proposed techniques to solving the high dimensional large margin nearest neighbor classification problem, yielding a speed-up of orders of magnitude.

[1]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[2]  Elad Hazan,et al.  Sparse Approximate Solutions to Semidefinite Programs , 2008, LATIN.

[3]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[4]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[5]  Wei Liu,et al.  Constrained Metric Learning Via Distance Gap Maximization , 2010, AAAI.

[6]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[7]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[8]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[9]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[10]  Yoram Singer,et al.  Efficient Learning using Forward-Backward Splitting , 2009, NIPS.

[11]  Tat-Seng Chua,et al.  An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization , 2009, ICML '09.

[12]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[13]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[14]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[15]  Elad Hazan,et al.  A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization , 2013, 1301.4666.

[16]  Peng Li,et al.  Distance Metric Learning with Eigenvalue Optimization , 2012, J. Mach. Learn. Res..

[17]  Rong Jin,et al.  O(logT) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions , 2013, ICML.

[18]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[19]  Martin Jaggi,et al.  Sparse Convex Optimization Methods for Machine Learning , 2011 .

[20]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[21]  Jinfeng Yi,et al.  Stochastic Gradient Descent with Only One Projection , 2012, NIPS.

[22]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.