Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximations to the Hessian, both using a diagonal and a low-rank matrix. Finally, we demonstrate the effectiveness of our method on a wide range of problems.

[1]  Rong Jin,et al.  Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[2]  Wei Shi,et al.  Curvature-aided incremental aggregated gradient method , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[4]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[5]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[6]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[7]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.

[8]  Tamás Kern,et al.  SVRG + + with Non-uniform Sampling , 2016 .

[9]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[10]  Robert M. Gower,et al.  Randomized Quasi-Newton Updates Are Linearly Convergent Matrix Inversion Algorithms , 2016, SIAM J. Matrix Anal. Appl..

[11]  Mark W. Schmidt Convergence rate of stochastic gradient with constant step size , 2014 .

[12]  Bruce Christianson,et al.  Automatic Hessians by reverse accumulation , 1992 .

[13]  William C. Davidon,et al.  Variance Algorithm for Minimization , 1968, Comput. J..

[14]  Andrea Walther Computing sparse Hessians with automatic differentiation , 2008, TOMS.

[15]  Jacek Gondzio,et al.  Action constrained quasi-Newton methods , 2014, ArXiv.

[16]  Roger Fletcher,et al.  A Rapidly Convergent Descent Method for Minimization , 1963, Comput. J..

[17]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[18]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[19]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[20]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[21]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[22]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[23]  H. Robbins A Stochastic Approximation Method , 1951 .