Gradient Regularization Improves Accuracy of Discriminative Models

Regularizing the gradient norm of the output of a neural network with respect to its inputs is a powerful technique, first proposed by Drucker & LeCun (1991) who named it Double Backpropagation. The idea has been independently rediscovered several times since then, most often with the goal of making models robust against adversarial sampling. This paper presents evidence that gradient regularization can consistently and significantly improve classification accuracy on vision tasks, especially when the amount of training data is small. We introduce our regularizers as members of a broader class of Jacobian-based regularizers, and compare them theoretically and empirically. A straightforward objection against minimizing the gradient norm at the training points is that a locally optimal solution, where the model has small gradients at the training points, may possibly contain large changes at other regions. We demonstrate through experiments on real and synthetic tasks that stochastic gradient descent is unable to find these locally optimal but globally unproductive solutions. Instead, it is forced to find solutions that generalize well.

[1]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[2]  Razvan Pascanu,et al.  Sobolev Training for Neural Networks , 2017, NIPS.

[3]  G. Wahba Spline models for observational data , 1990 .

[4]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[5]  Raja Giryes,et al.  Improving DNN Robustness to Adversarial Attacks using Jacobian Regularization , 2018, ECCV.

[6]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[7]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[8]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[15]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[16]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[17]  Lorenzo Rosasco,et al.  Nonparametric sparsity and regularization , 2012, J. Mach. Learn. Res..

[18]  Daniel Kifer,et al.  Unifying Adversarial Training Algorithms with Data Gradient Regularization , 2017, Neural Computation.

[19]  Luca Rigazio,et al.  Towards Deep Neural Network Architectures Robust to Adversarial Examples , 2014, ICLR.

[20]  Y. Le Cun,et al.  Double backpropagation increasing generalization performance , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[23]  Yuichi Yoshida,et al.  Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.