Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

Overparametrized Deep Neural Networks (DNNs) often achieve astounding performances, but may potentially result in severe generalization error. Recently, the relation between the sharpness of the loss landscape and the generalization error has been established by Foret et al. (2020), in which the Sharpness Aware Minimizer (SAM) was proposed to mitigate the degradation of the generalization. Unfortunately, SAM’s computational cost is roughly double that of base optimizers, such as Stochastic Gradient Descent (SGD). This paper thus proposes Efficient Sharpness Aware Minimizer (ESAM), which boosts SAM’s efficiency at no cost to its generalization performance. ESAM includes two novel and efficient training strategies—Stochastic Weight Perturbation and Sharpness-Sensitive Data Selection. In the former, the sharpness measure is approximated by perturbing a stochastically chosen set of weights in each iteration; in the latter, the SAM loss is optimized using only a judiciously selected subset of data that is sensitive to the sharpness. We provide theoretical explanations as to why these strategies perform well. We also show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computational overhead to 40% vis-à-vis base optimizers, while test accuracies are preserved or even improved. Our codes are avaliable at https://github.com/dydjw9/Efficient_SAM.

[1]  Roderick Karlemstrand,et al.  Using Early-Learning Regularization to Classify Real-World Noisy Data , 2021, ArXiv.

[2]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[3]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[7]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[8]  Richong Zhang,et al.  Regularizing Neural Networks via Adversarial Model Perturbation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Rong Jin,et al.  Scaled ReLU Matters for Training Vision Transformers , 2021, AAAI.

[10]  Yu Liu,et al.  Gradient Harmonized Single-stage Detector , 2018, AAAI.

[11]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[12]  Cho-Jui Hsieh,et al.  When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ArXiv.

[13]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[14]  Hossein Mobahi,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[15]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[16]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Robustness via Curvature Regularization, and Vice Versa , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[18]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[19]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Jungmin Kwon,et al.  ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.

[22]  Tengyu Ma,et al.  Label Noise SGD Provably Prefers Flat Global Minimizers , 2021, NeurIPS.

[23]  Quoc V. Le,et al.  Meta Pseudo Labels , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yisen Wang,et al.  Adversarial Weight Perturbation Helps Robust Generalization , 2020, NeurIPS.

[25]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tao Lin,et al.  On the Loss Landscape of Adversarial Training: Identifying Challenges and How to Overcome Them , 2020, NeurIPS.

[27]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.