Statistical inference using SGD

We present a novel method for frequentist statistical inference in $M$-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. From a practical perspective, our SGD-based inference procedure is a first order method, and is well-suited for large scale problems. To show its merits, we apply it to both synthetic and real datasets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[3]  Sébastien Bubeck,et al.  Finite-Time Analysis of Projected Langevin Monte Carlo , 2015, NIPS.

[4]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[5]  H. Robbins A Stochastic Approximation Method , 1951 .

[6]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[7]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[8]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[9]  G. Pflug Stochastic minimization with constant step-size: asymptotic laws , 1986 .

[10]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[11]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[12]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[13]  E. Airoldi,et al.  Asymptotic and finite-sample properties of estimators based on stochastic gradients , 2014 .

[14]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[15]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[16]  Arnold J Stromberg,et al.  Subsampling , 2001, Technometrics.

[17]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[18]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[19]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[20]  H. Kushner,et al.  Asymptotic Properties of Stochastic Approximations with Constant Coefficients. , 1981 .

[21]  Xin T. Tong,et al.  Statistical inference for model parameters in stochastic gradient descent , 2016, The Annals of Statistics.

[22]  G. Pflug,et al.  Stochastic approximation and optimization of random systems , 1992 .

[23]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[24]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[25]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[26]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..