论文信息 - Lookbehind Optimizer: k steps back, 1 step forward

Lookbehind Optimizer: k steps back, 1 step forward

The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that"look ahead"to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes $k$ gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods -- SAM and adaptive SAM (ASAM) -- and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.

A. Baratin | Sarath Chandar | Gonçalo Mordido | Pranshu Malviya

[1] Hoki Kim,et al. Exploring the Effect of Multi-step Ascent in Sharpness-Aware Minimization , 2023, arXiv.org.

[2] Nicolas Flammarion,et al. Towards Understanding Sharpness-Aware Minimization , 2022, ICML.

[3] Timothy M. Hospedales,et al. Fisher SAM: Information Geometry and Sharpness Aware Minimisation , 2022, ICML.

[4] Joey Tianyi Zhou,et al. Sharpness-Aware Training for Free , 2022, NeurIPS.

[5] Gonçalo Mordido,et al. MemSE: Fast MSE Prediction for Noisy Memristor-Based DNN Accelerators , 2022, 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[6] Hartwig Adam,et al. Surrogate Gap Minimization Improves Sharpness-Aware Training , 2022, ICLR.

[7] Cho-Jui Hsieh,et al. Towards Efficient and Scalable Sharpness-Aware Minimization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Muhao Chen,et al. Sharpness-Aware Minimization with Dynamic Reweighting , 2021, EMNLP.

[9] Sanket Vaibhav Mehta,et al. An Empirical Investigation of the Role of Pre-training in Lifelong Learning , 2021, J. Mach. Learn. Res..

[10] Bohan Zhuang,et al. Sharpness-aware Quantization for Deep Neural Networks , 2021, ArXiv.

[11] Jeff Z. HaoChen,et al. Self-supervised Learning is More Robust to Dataset Imbalance , 2021, ICLR.

[12] Joey Tianyi Zhou,et al. Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , 2021, ICLR.

[13] Pritish Narayanan,et al. Toward Software-Equivalent Accuracy on Transformer-Based Deep Neural Networks With Analog Memory Devices , 2021, Frontiers in Computational Neuroscience.

[14] B. Schiele,et al. Relating Adversarially Robust Generalization to Flat Minima , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Jungmin Kwon,et al. ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.

[16] Ariel Kleiner,et al. Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[17] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[18] Evangelos Eleftheriou,et al. Accurate deep neural network inference using computational phase-change memory , 2019, Nature Communications.

[19] Marc'Aurelio Ranzato,et al. On Tiny Episodic Memories in Continual Learning , 2019 .

[20] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[21] Philip H. S. Torr,et al. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[22] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.

[23] Marc'Aurelio Ranzato,et al. Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[24] Gintare Karolina Dziugaite,et al. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[25] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[26] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[27] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[28] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[29] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31] Norman P. Jouppi,et al. Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[32] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Cho-Jui Hsieh,et al. Random Sharpness-Aware Minimization , 2022, NeurIPS.

[34] Sarath Chandar,et al. Sharpness-Aware Training for Accurate Inference on Noisy DNN Accelerators , 2022, ArXiv.

[35] Gunshi Gupta,et al. Look-ahead Meta Learning for Continual Learning , 2020, NeurIPS.

[36] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[37] Jürgen Schmidhuber,et al. Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.