论文信息 - Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging - 字舞流文

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents the first tight non-asymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. These results are utilized in providing a highly parallelizable SGD algorithm that obtains the optimal statistical error rate with nearly the same number of serial updates as batch gradient descent, which improves significantly over existing SGD-style methods. Finally, this work sheds light on some fundamental differences in SGD’s behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure optimal statistical error rates for the agnostic case must be a function of the noise properties. The central analysis tools used by this paper are obtained through generalizing the operator view of averaged SGD, introduced by Défossez and Bach [1] followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques may be of broader interest in analyzing various computational aspects of stochastic approximation.

Prateek Jain | Sham M. Kakade | Praneeth Netrapalli | Aaron Sidford | Rahul Kidambi

[1] Prateek Jain,et al. Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm , 2016, COLT.

[2] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[3] Joseph K. Bradley,et al. Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[4] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[5] Martin J. Wainwright,et al. Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[6] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[7] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[8] Ohad Shamir,et al. Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[9] Jonathan D. Rosenblatt,et al. On the Optimality of Averaging in Distributed Statistical Learning , 2014, 1407.2724.

[10] Dimitris S. Papailiopoulos,et al. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[11] Francis R. Bach,et al. Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[12] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[13] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[14] Peter Richtárik,et al. Distributed Mini-Batch SDCA , 2015, ArXiv.

[15] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[16] F. Bach,et al. Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[17] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.

[18] Shai Shalev-Shwartz,et al. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[19] Francis R. Bach,et al. Self-concordant analysis for logistic regression , 2009, ArXiv.

[20] Eric Moulines,et al. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[21] Sham M. Kakade,et al. Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[22] Francis R. Bach,et al. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[23] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[24] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[25] Martin J. Wainwright,et al. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[26] Avleen Singh Bijral,et al. Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[27] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[28] Francis R. Bach,et al. From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[29] John C. Duchi,et al. Asynchronous stochastic convex optimization , 2015, 1508.00882.

[30] Gideon S. Mann,et al. Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.