Implicit stochastic gradient descent

Stochastic optimization procedures, such as stochastic gradient descent, have gained popularity for parameter estimation from large data sets. However, standard stochastic optimization procedures cannot effectively combine numerical stability with statistical and computational efficiency. Here, we introduce an implicit stochastic gradient descent procedure, the iterates of which are implicitly defined. Intuitively, implicit iterates shrink the standard iterates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed in practice, thus increasing stability without increasing the computational burden. When combined with averaging, the proposed procedure achieves statistical efficiency as well. We derive non-asymptotic bounds and characterize the asymptotic distribution of implicit procedures. Our analysis also reveals the asymptotic variance of a number of existing procedures. We demonstrate implicit stochastic gradient descent by further developing theory for generalized linear models, Cox proportional hazards, and M-estimation problems, and by carrying out extensive experiments. Our results suggest that the implicit stochastic gradient descent procedure is poised to become the workhorse of estimation with large data sets.

[1]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[2]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[3]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[4]  K. Chung On a Stochastic Approximation Method , 1954 .

[5]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[6]  I. Olkin,et al.  Multivariate Chebyshev Inequalities , 1960 .

[7]  D. Sakrison Efficient recursive estimation; application to estimating the parameters of a covariance function , 1965 .

[8]  J. Nagumo,et al.  A learning method for system identification , 1967, IEEE Transactions on Automatic Control.

[9]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[10]  M. T. Wasan Stochastic Approximation , 1969 .

[11]  R. Douglas Martin,et al.  Robust estimation via stochastic approximation , 1975, IEEE Trans. Inf. Theory.

[12]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  V. Fabian On Asymptotically Efficient Recursive Estimation , 1978 .

[15]  P. Lions,et al.  Splitting Algorithms for the Sum of Two Nonlinear Operators , 1979 .

[16]  L. R. Haff Empirical Bayes Estimation of the Multivariate Normal Covariance Matrix , 1980 .

[17]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[18]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[19]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[20]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[21]  Han-Fu Chen,et al.  Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds , 1987 .

[22]  Dipak K. Dey,et al.  Simultaneous estimation of eigenvalues , 1988 .

[23]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[24]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[25]  Adrian G. Barnett,et al.  An Introduction to Generalized Linear Models, Third Edition , 1990 .

[26]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[27]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[28]  G. Pflug,et al.  Stochastic approximation and optimization of random systems , 1992 .

[29]  A. M. Lyapunov The general problem of the stability of motion , 1992 .

[30]  Alan J. Miller,et al.  Least Squares Routines to Supplement Those of Gentleman , 1992 .

[31]  Dirk T. M. Slock,et al.  On the convergence behavior of the LMS and the normalized LMS algorithms , 1993, IEEE Trans. Signal Process..

[32]  Alan J. Miller Correction to Algorithm as 274: Least Squares Routines to Supplement Those of Gentleman , 1994 .

[33]  J. Klein,et al.  Survival Analysis: Techniques for Censored and Truncated Data , 1997 .

[34]  S. E. Ahmed,et al.  Large-sample estimation strategies for eigenvalues of a Wishart matrix , 1998 .

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[37]  Denis J. Dean,et al.  Comparison of neural networks and discriminant analysis in predicting forest cover types , 1998 .

[38]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[39]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[40]  J. Schwartz,et al.  The National Morbidity, Mortality, and Air Pollution Study. Part II: Morbidity and mortality from air pollution in the United States. , 2000, Research report.

[41]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[42]  Scott L. Zeger,et al.  Air Pollution and Mortality , 2002 .

[43]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[44]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[45]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[46]  Dale Schuurmans,et al.  implicit Online Learning with Kernels , 2006, NIPS.

[47]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[48]  Noureddine El Karoui Spectrum estimation for large dimensional covariance matrices using random matrix theory , 2006, math/0609418.

[49]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[50]  Rzysztof,et al.  A Geometric View of Non-Linear On-Line Stochastic Gradient Descent , 2007 .

[51]  Claude Brezinski,et al.  Numerical Methods for Engineers and Scientists , 1992 .

[52]  Xavier Mestre,et al.  Improved Estimation of Eigenvalues and Eigenvectors of Covariance Matrices Using Their Sample Estimates , 2008, IEEE Transactions on Information Theory.

[53]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[54]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[55]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[56]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[57]  Yoram Singer,et al.  Efficient Learning using Forward-Backward Splitting , 2009, NIPS.

[58]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[59]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[60]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[61]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[62]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[63]  Ming Yang,et al.  Large-scale image classification: Fast feature extraction and SVM training , 2011, CVPR 2011.

[64]  John Langford,et al.  Online Importance Weight Aware Updates , 2010, UAI.

[65]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[66]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[67]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[68]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[69]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[70]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[71]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[72]  Martin Kiefel,et al.  Quasi-Newton Methods: A New Direction , 2012, ICML.

[73]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[74]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[75]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[76]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[77]  Prateek Jain,et al.  On Iterative Hard Thresholding Methods for High-dimensional M-Estimation , 2014, NIPS.

[78]  Edoardo M. Airoldi,et al.  Statistical analysis of stochastic gradient methods for generalized linear models , 2014, ICML.

[79]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[80]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[81]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[82]  Edoardo M. Airoldi,et al.  Scalable estimation strategies based on stochastic approximations: classical results and new insights , 2015, Statistics and Computing.

[83]  S. Wood,et al.  Generalized additive models for large data sets , 2015 .

[84]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[85]  L. Rosasco,et al.  Convergence of Stochastic Proximal Gradient Algorithm , 2014, Applied Mathematics & Optimization.

[86]  D.,et al.  Regression Models and Life-Tables , 2022 .