Lookbehind Optimizer: k steps back, 1 step forward

The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that"look ahead"to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes $k$ gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods -- SAM and adaptive SAM (ASAM) -- and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.

[1]  Hoki Kim,et al.  Exploring the Effect of Multi-step Ascent in Sharpness-Aware Minimization , 2023, arXiv.org.

[2]  Nicolas Flammarion,et al.  Towards Understanding Sharpness-Aware Minimization , 2022, ICML.

[3]  Timothy M. Hospedales,et al.  Fisher SAM: Information Geometry and Sharpness Aware Minimisation , 2022, ICML.

[4]  Joey Tianyi Zhou,et al.  Sharpness-Aware Training for Free , 2022, NeurIPS.

[5]  Gonçalo Mordido,et al.  MemSE: Fast MSE Prediction for Noisy Memristor-Based DNN Accelerators , 2022, 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[6]  Hartwig Adam,et al.  Surrogate Gap Minimization Improves Sharpness-Aware Training , 2022, ICLR.

[7]  Cho-Jui Hsieh,et al.  Towards Efficient and Scalable Sharpness-Aware Minimization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Muhao Chen,et al.  Sharpness-Aware Minimization with Dynamic Reweighting , 2021, EMNLP.

[9]  Sanket Vaibhav Mehta,et al.  An Empirical Investigation of the Role of Pre-training in Lifelong Learning , 2021, J. Mach. Learn. Res..

[10]  Bohan Zhuang,et al.  Sharpness-aware Quantization for Deep Neural Networks , 2021, ArXiv.

[11]  Jeff Z. HaoChen,et al.  Self-supervised Learning is More Robust to Dataset Imbalance , 2021, ICLR.

[12]  Joey Tianyi Zhou,et al.  Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , 2021, ICLR.

[13]  Pritish Narayanan,et al.  Toward Software-Equivalent Accuracy on Transformer-Based Deep Neural Networks With Analog Memory Devices , 2021, Frontiers in Computational Neuroscience.

[14]  B. Schiele,et al.  Relating Adversarially Robust Generalization to Flat Minima , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Jungmin Kwon,et al.  ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.

[16]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[17]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[18]  Evangelos Eleftheriou,et al.  Accurate deep neural network inference using computational phase-change memory , 2019, Nature Communications.

[19]  Marc'Aurelio Ranzato,et al.  On Tiny Episodic Memories in Continual Learning , 2019 .

[20]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[21]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[22]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[23]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[24]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[25]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[26]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[27]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[28]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Norman P. Jouppi,et al.  Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[32]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Cho-Jui Hsieh,et al.  Random Sharpness-Aware Minimization , 2022, NeurIPS.

[34]  Sarath Chandar,et al.  Sharpness-Aware Training for Accurate Inference on Noisy DNN Accelerators , 2022, ArXiv.

[35]  Gunshi Gupta,et al.  Look-ahead Meta Learning for Continual Learning , 2020, NeurIPS.

[36]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[37]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.