Less than a Single Pass: Stochastically Controlled Stochastic Gradient

We develop and analyze a procedure for gradient-based optimization that we refer to as stochastically controlled stochastic gradient (SCSG). As a member of the SVRG family of algorithms, SCSG makes use of gradient estimates at two scales, with the number of updates at the faster scale being governed by a geometric random variable. Unlike most existing algorithms in this family, both the computation cost and the communication cost of SCSG do not necessarily scale linearly with the sample size $n$; indeed, these costs are independent of $n$ when the target accuracy is low. An experimental evaluation on real datasets confirms the effectiveness of SCSG.

[1]  Thomas Hofmann,et al.  Starting Small - Learning with Adaptive Sample Sizes , 2016, ICML.

[2]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[3]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[4]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[5]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[6]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[7]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[8]  Mark W. Schmidt,et al.  StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[9]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[10]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[11]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[12]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..

[13]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[14]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[15]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[17]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[18]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[19]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[20]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[21]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[22]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[23]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[24]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[25]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[26]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[27]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[28]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[29]  D. Hunter MM algorithms for generalized Bradley-Terry models , 2003 .

[30]  Yuval Rabani,et al.  An improved approximation algorithm for multiway cut , 1998, STOC '98.

[31]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[32]  Patrice Marcotte,et al.  Co-Coercivity and Its Role in the Convergence of Iterative Schemes for Solving Variational Inequalities , 1996, SIAM J. Optim..

[33]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[34]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[35]  Lin Xiao,et al.  An Accelerated Proximal Coordinate Gradient Method , 2014, NIPS.

[36]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[37]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[38]  H. Robbins A Stochastic Approximation Method , 1951 .

[39]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[40]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[41]  R. A. Bradley,et al.  Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , 1952 .

[42]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[43]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[44]  Atsushi Nitanda,et al.  Accelerated Stochastic Gradient Descent for Minimizing Finite Sums , 2015, AISTATS.

[45]  Tengyu Ma,et al.  Distributed Stochastic Variance Reduced Gradient Methods , 2015, ArXiv.

[46]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[47]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[48]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.