Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm’s versatility and demonstrating its strong performance compared to other start-of-theart first-order and second-order methods.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[3]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[4]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[5]  Yura Malitsky,et al.  Adaptive gradient descent without descent , 2019, ICML.

[6]  Albert S. Berahas,et al.  Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample , 2019, ArXiv.

[7]  Peter Richtárik,et al.  New Convergence Aspects of Stochastic Gradient Algorithms , 2018, J. Mach. Learn. Res..

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[10]  Albert S. Berahas,et al.  Scaling Up Quasi-newton Algorithms: Communication Efficient Distributed SR1 , 2019, LOD.

[11]  Albert S. Berahas,et al.  SONIA: A Symmetric Blockwise Truncated Optimization Algorithm , 2020, AISTATS.

[12]  Erik Meijer,et al.  Gradient Descent: The Ultimate Optimizer , 2019, ArXiv.

[13]  Aryan Mokhtari,et al.  A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points , 2017, SIAM J. Optim..

[14]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[15]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[16]  Peng Xu,et al.  Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.

[17]  Aryan Mokhtari,et al.  Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy , 2018, AISTATS.

[18]  Sharan Vaswani,et al.  Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , 2020, AISTATS.

[19]  Peng Xu,et al.  Newton-type methods for non-convex optimization under inexact Hessian information , 2017, Math. Program..

[20]  Michael W. Mahoney,et al.  Sub-sampled Newton methods , 2018, Math. Program..

[21]  Jorge Nocedal,et al.  A Multi-Batch L-BFGS Method for Machine Learning , 2016, NIPS.

[22]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[23]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[24]  Yang Liu,et al.  Newton-MR: Newton's Method Without Smoothness or Convexity , 2018, ArXiv.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[28]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[29]  Frank E. Curtis,et al.  A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization , 2016, ICML.

[30]  R. Fletcher Practical Methods of Optimization , 1988 .

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Y. Saad,et al.  An estimator for the diagonal of a matrix , 2007 .

[33]  H. Robbins A Stochastic Approximation Method , 1951 .

[34]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[35]  Peng Xu,et al.  Inexact Non-Convex Newton-Type Methods , 2018, 1802.06925.

[36]  Aryan Mokhtari,et al.  Global convergence of online limited memory BFGS , 2014, J. Mach. Learn. Res..

[37]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[38]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[39]  Jorge Nocedal,et al.  An investigation of Newton-Sketch and subsampled Newton methods , 2017, Optim. Methods Softw..

[40]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[41]  J. J. Moré,et al.  Quasi-Newton Methods, Motivation and Theory , 1974 .

[42]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[43]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[44]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[45]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.