BN-invariant Sharpness Regularizes the Training Model to Better Generalization

It is arguably believed that flatter minima can generalize better. However, it has been pointed out that the usual definitions of sharpness, which consider either the maxima or the integral of loss over a δ ball of parameters around minima, cannot give consistent measurement for scale invariant neural networks, e.g., networks with batch normalization layer. In this paper, we first propose a measure of sharpness, BN-Sharpness, which gives consistent value for equivalent networks under BN. It achieves the property of scale invariance by connecting the integral diameter with the scale of parameter. Then we present a computation-efficient way to calculate the BN-sharpness approximately i.e., one dimensional integral along the ”sharpest” direction. Furthermore, we use the BN-sharpness to regularize the training and design an algorithm to minimize the new regularized objective. Our algorithm achieves considerably better performance than vanilla SGD over various experiment settings.

[1]  P. Absil,et al.  Erratum to: ``Global rates of convergence for nonconvex optimization on manifolds'' , 2016, IMA Journal of Numerical Analysis.

[2]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[3]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  Xianglong Liu,et al.  Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks , 2017, AAAI.

[6]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[7]  Chunpeng Wu,et al.  SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Xiaolin Li,et al.  Generalized Batch Normalization: Towards Accelerating Deep Neural Networks , 2018, AAAI.

[10]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[16]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[17]  史惠蓉,et al.  Robert子宫二例 , 2009 .

[18]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[19]  David W. Jacobs,et al.  Automated Inference with Adaptive Batches , 2017, AISTATS.

[20]  Lei Wu,et al.  How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective , 2018, NeurIPS.

[21]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[22]  Minhyung Cho,et al.  Riemannian approach to batch normalization , 2017, NIPS.

[23]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Pierre-Antoine Absil,et al.  Joint Diagonalization on the Oblique Manifold for Independent Component Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Huan Wang,et al.  Identifying Generalization Properties in Neural Networks , 2018, ArXiv.

[28]  Manfred K. Warmuth Proceedings of the seventh annual conference on Computational learning theory , 1994, COLT 1994.