Amortized Proximal Optimization

We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.

[1]  Board , 2023, Médecine des Maladies Métaboliques.

[2]  M. Ghassemi,et al.  If Influence Functions are the Answer, Then What is the Question? , 2022, NeurIPS.

[3]  Timothy M. Hospedales,et al.  Meta Mirror Descent: Optimiser Learning for Fast Convergence , 2022, ArXiv.

[4]  Paul Vicol,et al.  Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies , 2021, ICML.

[5]  Yue Wu,et al.  SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Samuel S. Schoenholz,et al.  Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization , 2020, ICML.

[7]  Amos Storkey,et al.  Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons , 2020, ArXiv.

[8]  Sharan Vaswani,et al.  Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , 2020, AISTATS.

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[11]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[12]  P. Frasconi,et al.  Marthe: Scheduling the Learning Rate Via Online Hypergradients , 2019, International Joint Conference on Artificial Intelligence.

[13]  Theodore H. Moskovitz,et al.  First-Order Preconditioning via Hypergradient Descent , 2019, ArXiv.

[14]  Andrei A. Rusu,et al.  Meta-Learning with Warped Gradient Descent , 2019, ICLR.

[15]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[16]  James Martens,et al.  Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.

[17]  Mark W. Schmidt,et al.  Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates , 2019, NeurIPS.

[18]  S. Kakade,et al.  Revisiting the Polyak step size , 2019, 1905.00313.

[19]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[20]  Angelika Steger,et al.  Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning , 2019, ICML.

[21]  Junier B. Oliva,et al.  Meta-Curvature , 2019, NeurIPS.

[22]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[23]  Aaron Mishkin,et al.  SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient , 2018, NeurIPS.

[24]  John C. Duchi,et al.  Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity , 2018, SIAM J. Optim..

[25]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[26]  Jeremy Nixon,et al.  Learned optimizers that outperform on wall-clock and validation loss , 2018 .

[27]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[28]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[29]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[30]  Angelika Steger,et al.  Approximating Real-Time Recurrent Learning with Random Kronecker Factors , 2018, NeurIPS.

[31]  David Rolnick,et al.  Measuring and regularizing networks in function space , 2018, ICLR.

[32]  Richard S. Zemel,et al.  Aggregated Momentum: Stability Through Passive Damping , 2018, ICLR.

[33]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[34]  Roger B. Grosse,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[35]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[36]  Pascal Vincent,et al.  An Evaluation of Fisher Approximations Beyond Kronecker Factorization , 2018, ICLR.

[37]  Georg Martius,et al.  L4: Practical loss-based stepsize adaptation for deep learning , 2018, NeurIPS.

[38]  Seungjin Choi,et al.  Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace , 2018, ICML.

[39]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[40]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[41]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[44]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[45]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[46]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[47]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[48]  Yann Ollivier,et al.  Unbiased Online Recurrent Optimization , 2017, ICLR.

[49]  Kevin G. Jamieson,et al.  Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization , 2016, ICLR.

[50]  Sergio Gomez Colmenarejo,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[51]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[52]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[53]  James Martens Second-order Optimization for Neural Networks , 2016 .

[54]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[57]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[58]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[59]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[60]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[63]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[64]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[65]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[66]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[67]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[68]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[69]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[70]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[71]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[72]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[73]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[74]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[75]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[76]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[77]  H. Robbins A Stochastic Approximation Method , 1951 .

[78]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[79]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[80]  Geoffrey E. Hinton,et al.  Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks , 2006 .

[81]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[82]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[83]  S. Amari Natural Gradient Works Eciently in Learning , 2022 .