Statistical analysis of stochastic gradient methods for generalized linear models

We study the statistical properties of stochastic gradient descent (SGD) using explicit and implicit updates for fitting generalized linear models (GLMs). Initially, we develop a computationally efficient algorithm to implement implicit SGD learning of GLMs. Next, we obtain exact formulas for the bias and variance of both updates which leads to two important observations on their comparative statistical properties. First, in small samples, the estimates from the implicit procedure are more biased than the estimates from the explicit one, but their empirical variance is smaller and they are more robust to learning rate misspecification. Second, the two procedures are statistically identical in the limit: they are both unbiased, converge at the same rate and have the same asymptotic variance. Our set of experiments confirm our theory and more broadly suggest that the implicit procedure can be a competitive choice for fitting large-scale models, especially when robustness is a concern.

[1]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[2]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[3]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[4]  D. Sakrison Efficient recursive estimation; application to estimating the parameters of a covariance function , 1965 .

[5]  J. Nagumo,et al.  A learning method for system identification , 1967, IEEE Transactions on Automatic Control.

[6]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[7]  D. Anbar On Optimal Estimation Methods Using Stochastic Approximation Procedures , 1973 .

[8]  V. Fabian Asymptotically Efficient Stochastic Approximation; The RM Case , 1973 .

[9]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[10]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[11]  Dirk T. M. Slock,et al.  On the convergence behavior of the LMS and the normalized LMS algorithms , 1993, IEEE Trans. Signal Process..

[12]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[13]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[14]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[15]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[16]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[17]  Dale Schuurmans,et al.  implicit Online Learning with Kernels , 2006, NIPS.

[18]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[19]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[20]  H. Robbins A Stochastic Approximation Method , 1951 .

[21]  Claude Brezinski,et al.  Numerical Methods for Engineers and Scientists , 1992 .

[22]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[23]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[24]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[25]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[26]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[27]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[28]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[29]  A. Edwards Fisher, Ronald A. , 2013 .

[30]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[31]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.