Lipschitz regularized Deep Neural Networks converge and generalize

Generalization of deep neural networks (DNNs) is an open problem which, if solved, could impact the reliability and verification of deep neural network architectures. In this paper, we show that if the usual fidelity term used in training DNNs is augmented by a Lipschitz regularization term, then the networks converge and generalize. The convergence is in the limit as the number of data points, n → ∞, while also allowing the network to grow as needed to fit the data. Two regimes are identified: in the case of clean labels, we prove convergence to the label function which corresponds to zero loss, in the case of corrupted labels which we prove convergence to a regularized label function which is the solution of a limiting variational problem. In both cases, a convergence rate is also provided.

[1]  W. Rudin Principles of mathematical analysis , 1964 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  E. J. McShane,et al.  Extension of range of functions , 1934 .

[4]  Xiang Wei,et al.  Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect , 2018, ICLR.

[5]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[6]  Adam M. Oberman,et al.  Improved robustness to adversarial examples using Lipschitz regularization of the loss , 2018, ArXiv.

[7]  Tomaso A. Poggio,et al.  A Surprising Linear Relationship Predicts Test Performance in Deep Networks , 2018, ArXiv.

[8]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[9]  Daniel Cremers,et al.  Global Solutions of Variational Models with Convex Regularization , 2010, SIAM J. Imaging Sci..

[10]  B. Dacorogna Direct methods in the calculus of variations , 1989 .

[11]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[12]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[13]  Patrick D. McDaniel,et al.  Making machine learning robust against adversarial inputs , 2018, Commun. ACM.

[14]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[15]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[16]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[17]  Carole Le Guyader,et al.  Extrapolation of Vector Fields Using the Infinity Laplacian and with Applications to Image Segmentation , 2009, SSVM.

[18]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[19]  Luminita A. Vese,et al.  An image decomposition model using the total variation and the infinity Laplacian , 2007, Electronic Imaging.

[20]  Harris Drucker,et al.  Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[21]  M. Crandall,et al.  A TOUR OF THE THEORY OF ABSOLUTELY MINIMIZING FUNCTIONS , 2004 .

[22]  Mathew D. Penrose,et al.  Random Geometric Graphs , 2003 .

[23]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[24]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[25]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[28]  Dejan Slepcev,et al.  Analysis of $p$-Laplacian Regularization in Semi-Supervised Learning , 2017, SIAM J. Math. Anal..

[29]  J. Dall,et al.  Random geometric graphs. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Andrea Braides Gamma-Convergence for Beginners , 2002 .

[31]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[32]  M. Talagrand The Generic chaining : upper and lower bounds of stochastic processes , 2005 .

[33]  Daniel A. Spielman,et al.  Algorithms for Lipschitz Learning on Graphs , 2015, COLT.

[34]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[35]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[36]  Pierre Kornprobst,et al.  Mathematical problems in image processing - partial differential equations and the calculus of variations , 2010, Applied mathematical sciences.

[37]  Ahmed El Alaoui,et al.  Asymptotic behavior of \(\ell_p\)-based Laplacian regularization in semi-supervised learning , 2016, COLT.

[38]  Bernhard Pfahringer,et al.  Regularisation of neural networks by enforcing Lipschitz continuity , 2018, Machine Learning.

[39]  Yuichi Yoshida,et al.  Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.

[40]  Jeff Calder,et al.  Consistency of Lipschitz learning with infinite unlabeled data and finite labeled data , 2017, SIAM J. Math. Data Sci..

[41]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[42]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[43]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .