Robust Large Margin Deep Neural Networks

The generalization error of deep neural networks via their classification margin is studied in this paper. Our approach is based on the Jacobian matrix of a deep neural network and can be applied to networks with arbitrary nonlinearities and pooling layers, and to networks with different architectures such as feed forward networks and residual networks. Our analysis leads to the conclusion that a bounded spectral norm of the network's Jacobian matrix in the neighbourhood of the training samples is crucial for a deep neural network of arbitrary depth and width to generalize well. This is a significant improvement over the current bounds in the literature, which imply that the generalization error grows with either the width or the depth of the network. Moreover, it shows that the recently proposed batch normalization and weight normalization reparametrizations enjoy good generalization properties, and leads to a novel network regularizer based on the network's Jacobian matrix. The analysis is supported with experimental results on the MNIST, CIFAR-10, LaRED, and ImageNet datasets.

[1]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[2]  Guillermo Sapiro,et al.  Generalization Error of Invariant Classifiers , 2016, AISTATS.

[3]  Yonina C. Eldar,et al.  Tradeoffs Between Convergence Speed and Reconstruction Accuracy in Inverse Problems , 2016, IEEE Transactions on Signal Processing.

[4]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[5]  Serge J. Belongie,et al.  Residual Networks are Exponential Ensembles of Relatively Shallow Networks , 2016, ArXiv.

[6]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[7]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[8]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[9]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[10]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[11]  Thomas Wiatowski,et al.  A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction , 2015, IEEE Transactions on Information Theory.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  A. Robert Calderbank,et al.  Discriminative Robust Transformation Learning , 2015, NIPS.

[14]  Mohammed Bennamoun,et al.  Contractive Rectifier Networks for Nonlinear Maximum Margin Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Martin J. Wainwright,et al.  Learning Halfspaces and Neural Networks with Random Initialization , 2015, ArXiv.

[16]  Ruslan Salakhutdinov,et al.  Data-Dependent Path Normalization in Neural Networks , 2015, ICLR.

[17]  A. Shashua,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT.

[18]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[19]  Tie-Yan Liu,et al.  On the Depth of Deep Neural Networks: A Theoretical View , 2015, AAAI.

[20]  Tie-Yan Liu,et al.  Large Margin Deep Neural Networks: Theory and Algorithms , 2015, ArXiv.

[21]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[22]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[23]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[24]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[25]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[26]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[27]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Wen-Huang Cheng,et al.  LaRED: a large RGB-D extensible hand gesture dataset , 2014, MMSys '14.

[30]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[31]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[32]  Yichuan Tang,et al.  Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[33]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[34]  Joan Bruna,et al.  Learning Stable Group Invariant Representations with Convolutional Networks , 2013, ICLR.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[37]  Nakul Verma,et al.  Distance Preserving Embeddings for General n-Dimensional Manifolds , 2012, COLT.

[38]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[39]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[40]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[41]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[42]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[43]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[44]  Chong Jin Ong,et al.  Feature selection via sensitivity analysis of SVM probabilistic outputs , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[45]  S. Mendelson,et al.  Uniform Uncertainty Principle for Bernoulli and Subgaussian Ensembles , 2006, math/0608665.

[46]  Daming Shi,et al.  Sensitivity analysis applied to the construction of radial basis function networks , 2005, Neural Networks.

[47]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[48]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[49]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  G. Watson Characterization of the subdifferential of some matrix norms , 1992 .

[51]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[52]  Dianne Easterling,et al.  March , 1890, The Hospital.

[53]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[54]  S. Shalev-Shwartz,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .

[55]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Yoshua Bengio Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[58]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[59]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[60]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[61]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2022 .