Stochastic Function Norm Regularization of Deep Networks

Deep neural networks have had an enormous impact on image analysis. State-of-the-art training methods, based on weight decay and DropOut, result in impressive performance when a very large training set is available. However, they tend to have large problems overfitting to small data sets. Indeed, the available regularization methods deal with the complexity of the network function only indirectly. In this paper, we study the feasibility of directly using the $L_2$ function norm for regularization. Two methods to integrate this new regularization in the stochastic backpropagation are proposed. Moreover, the convergence of these new algorithms is studied. We finally show that they outperform the state-of-the-art methods in the low sample regime on benchmark datasets (MNIST and CIFAR10). The obtained results demonstrate very clear improvement, especially in the context of small sample regimes with data laying in a low dimensional manifold. Source code of the method can be found at \url{this https URL}.

[1]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[2]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[3]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[4]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[5]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[6]  Alan J. Lee,et al.  U-Statistics: Theory and Practice , 1990 .

[7]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[8]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[11]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[13]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[14]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[15]  G. Wahba Splines in Nonparametric Regression , 2006 .

[16]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[17]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[18]  Eduardo Sontag VC dimension of neural networks , 1998 .

[19]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[20]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[22]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[23]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[24]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[25]  T. Poggio,et al.  Networks and the best approximation property , 1990, Biological Cybernetics.

[26]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[28]  Franco Scarselli,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[29]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[30]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[31]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[32]  Janos Galambos,et al.  Advanced probability theory , 1988 .

[33]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[34]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[37]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[38]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[39]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[40]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[41]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[42]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[43]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[44]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[46]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[47]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[48]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[49]  Marek Karpinski,et al.  Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks , 1997, J. Comput. Syst. Sci..

[50]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[51]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[52]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[53]  Colas Schretter,et al.  Monte Carlo and Quasi-Monte Carlo Methods , 2016 .

[54]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[55]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[56]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[57]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.