Convergence diagnostics for stochastic gradient descent with constant step size

Iterative procedures in stochastic optimization are typically comprised of a transient phase and a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in a convergence region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant step size. We present theoretical and experimental results suggesting that the diagnostic behaves as intended, and the region where the diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region, and support this theoretical result with simulated experiments. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate each time convergence is detected. This leads to remarkable speed gains that are empirically comparable to state-of-art procedures.

[1]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[4]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[5]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[6]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[7]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[8]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[9]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[10]  F. Downton Stochastic Approximation , 1969, Nature.

[11]  G. Ch. Pflug,et al.  Stepsize Rules, Stopping Times and their Implementation in Stochastic Quasigradient Algorithms , 1988 .

[12]  E. Airoldi,et al.  Asymptotic and finite-sample properties of estimators based on stochastic gradients , 2014 .

[13]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[14]  G. Pflug Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size , 1990 .

[15]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[16]  Edoardo M. Airoldi,et al.  Scalable estimation strategies based on stochastic approximations: classical results and new insights , 2015, Statistics and Computing.

[17]  Bernard Delyon,et al.  Accelerated Stochastic Approximation , 1993, SIAM J. Optim..

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[20]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[21]  Edoardo M. Airoldi,et al.  Towards Stability and Optimality in Stochastic Gradient Descent , 2015, AISTATS.

[22]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[23]  L. Rosasco,et al.  Convergence of Stochastic Proximal Gradient Algorithm , 2014, Applied Mathematics & Optimization.

[24]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[25]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[26]  M. T. Wasan Stochastic Approximation , 1969 .

[27]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[28]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[29]  Georg Ch. Pflug Gradient estimates for the performance of markov chains and discrete event processes , 1992, Ann. Oper. Res..

[30]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[31]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[32]  Dimitris S. Papailiopoulos,et al.  Gradient Diversity: a Key Ingredient for Scalable Distributed Learning , 2017, AISTATS.

[33]  Edoardo M. Airoldi,et al.  Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[34]  Yuri Ermoliev,et al.  Numerical techniques for stochastic optimization , 1988 .