Scalable estimation strategies based on stochastic approximations: classical results and new insights

Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which model parameters are updated sequentially using small batches of data at each step. Here, we review early work and modern results that illustrate the statistical properties of these methods, including convergence rates, stability, and asymptotic bias and variance. We then overview modern applications where these methods are useful, ranging from an online version of the EM algorithm to deep learning. In light of these results, we argue that stochastic gradient methods are poised to become benchmark principled estimation procedures for large datasets, especially those in the family of stable proximal methods, such as implicit stochastic gradient descent.

[1]  Dimitri P. Bertsekas,et al.  Stabilization of Stochastic Iterative Methods for Singular and Nearly Singular Linear Systems , 2014, Math. Oper. Res..

[2]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[3]  C. G. Broyden A Class of Methods for Solving Nonlinear Simultaneous Equations , 1965 .

[4]  R. Douglas Martin,et al.  Robust estimation via stochastic approximation , 1975, IEEE Trans. Inf. Theory.

[5]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[6]  Jalal Almhana,et al.  Online EM algorithm for mixture with application to internet traffic modeling , 2004 .

[7]  Léon Bottou,et al.  On-line learning for very large data sets: Research Articles , 2005 .

[8]  J. Nagumo,et al.  A learning method for system identification , 1967, IEEE Transactions on Automatic Control.

[9]  N. Pillai,et al.  Ergodicity of Approximate MCMC Chains with Applications to Large Data Sets , 2014, 1405.0182.

[10]  K. Chung On a Stochastic Approximation Method , 1954 .

[11]  H. Robbins,et al.  Adaptive Design and Stochastic Approximation , 1979 .

[12]  Edoardo M. Airoldi,et al.  Implicit Temporal Differences , 2014, ArXiv.

[13]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[14]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[15]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[16]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[17]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[18]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[19]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[20]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[21]  Olivier Capp'e Online EM Algorithm for Hidden Markov Models , 2009, 0908.2359.

[22]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[23]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[24]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[25]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[26]  Shin Ishii,et al.  On-line EM Algorithm for the Normalized Gaussian Network , 2000, Neural Computation.

[27]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[28]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[29]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[30]  P. Dupuis,et al.  On sampling controlled stochastic approximation , 1991 .

[31]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[32]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[33]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[34]  Han-Fu Chen,et al.  Asymptotically efficient stochastic approximation , 1993 .

[35]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[36]  Noureddine El Karoui Spectrum estimation for large dimensional covariance matrices using random matrix theory , 2006, math/0609418.

[37]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[38]  Mikhail Borisovich Nevelʹson,et al.  Stochastic Approximation and Recursive Estimation , 1976 .

[39]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[40]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[41]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[42]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[43]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[44]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[45]  D. Titterington Recursive Parameter Estimation Using Incomplete Data , 1984 .

[46]  Rory A. Fisher,et al.  Theory of Statistical Estimation , 1925, Mathematical Proceedings of the Cambridge Philosophical Society.

[47]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[48]  Edoardo M. Airoldi,et al.  Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[49]  G. Pflug,et al.  Stochastic approximation and optimization of random systems , 1992 .

[50]  V. Fabian Asymptotically Efficient Stochastic Approximation; The RM Case , 1973 .

[51]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[52]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[53]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.

[54]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[55]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[56]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[57]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[58]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[59]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[60]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[61]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[62]  C. Z. Wei Multivariate Adaptive Stochastic Approximation , 1987 .

[63]  Dale Schuurmans,et al.  implicit Online Learning with Kernels , 2006, NIPS.

[64]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[65]  Hiroshi Nakagawa,et al.  Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[66]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[67]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[68]  Abhijit Gosavi,et al.  Reinforcement Learning: A Tutorial Survey and Recent Advances , 2009, INFORMS J. Comput..

[69]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[70]  Manfred K. Warmuth,et al.  On the worst-case analysis of temporal-difference learning algorithms , 2004, Machine Learning.

[71]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[72]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[73]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[74]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[75]  Lihong Li,et al.  A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[76]  Martin Kiefel,et al.  Quasi-Newton Methods: A New Direction , 2012, ICML.

[77]  Max Welling,et al.  Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget , 2013, ICML 2014.

[78]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[79]  K. Lange A gradient algorithm locally equivalent to the EM algorithm , 1995 .

[80]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[81]  L. Rosasco,et al.  Convergence of Stochastic Proximal Gradient Algorithm , 2014, Applied Mathematics & Optimization.

[82]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[83]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[84]  Dirk T. M. Slock,et al.  On the convergence behavior of the LMS and the normalized LMS algorithms , 1993, IEEE Trans. Signal Process..

[85]  D. Sakrison Efficient recursive estimation; application to estimating the parameters of a covariance function , 1965 .

[86]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[87]  R. Has’minskiĭ,et al.  Stochastic Approximation and Recursive Estimation , 1976 .

[88]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[89]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[90]  Jeffrey S. Rosenthal,et al.  Optimal Proposal Distributions and Adaptive MCMC , 2011 .

[91]  J. H. Venter An extension of the Robbins-Monro procedure , 1967 .

[92]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[93]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .