论文信息 - Function Norms and Regularization in Deep Networks

Function Norms and Regularization in Deep Networks

Deep neural networks (DNNs) have become increasingly important due to their excellent empirical performance on a wide range of problems. However, regularization is generally achieved by indirect means, largely due to the complex set of functions defined by a network and the difficulty in measuring function complexity. There exists no method in the literature for additive regularization based on a norm of the function, as is classically considered in statistical learning theory. In this work, we propose sampling-based approximations to weighted function norms as regularizers for deep neural networks. We provide, to the best of our knowledge, the first proof in the literature of the NP-hardness of computing function norms of DNNs, motivating the necessity of an approximate approach. We then derive a generalization bound for functions trained with weighted norms and prove that a natural stochastic optimization strategy minimizes the bound. Finally, we empirically validate the improved performance of the proposed regularization strategies for both convex function sets as well as DNNs on real-world classification and image segmentation tasks demonstrating improved performance over weight decay, dropout, and batch normalization. Source code will be released at the time of publication.

Matthew B. Blaschko | Maxim Berman | Amal Rannen Triki | A. Triki | Maxim Berman

[1] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[2] A Tikhonov,et al. Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[3] Grace Wahba,et al. Spline Models for Observational Data , 1990 .

[4] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[5] Alexander J. Smola,et al. Learning with kernels , 1998 .

[6] Tomaso A. Poggio,et al. Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[7] Pierre Baldi,et al. Understanding Dropout , 2013, NIPS.

[8] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[9] S. Mendelson,et al. Regularization in kernel learning , 2010, 1001.2094.

[10] Ruslan Salakhutdinov,et al. Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[11] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[12] Stephen A. Cook,et al. The complexity of theorem-proving procedures , 1971, STOC.

[13] Alan J. Lee,et al. U-Statistics: Theory and Practice , 1990 .

[14] Philip M. Long,et al. On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[15] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[16] Harris Drucker,et al. Comparison of learning algorithms for handwritten digit recognition , 1995 .

[17] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.

[18] Richard Hans Robert Hahnloser,et al. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[19] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[20] Sebastian Nowozin,et al. On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[22] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[23] G. Wahba. Splines in Nonparametric Regression , 2006 .

[24] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Eugenio Culurciello,et al. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[26] T. Poggio,et al. Networks and the best approximation property , 1990, Biological Cybernetics.

[27] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[28] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29] René Vidal,et al. Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Andrew Zisserman,et al. A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[32] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[33] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[34] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[35] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.