Global Capacity Measures for Deep ReLU Networks via Path Sampling

Classical results on the statistical complexity of linear models have commonly identified the norm of the weights $\|w\|$ as a fundamental capacity measure. Generalizations of this measure to the setting of deep networks have been varied, though a frequently identified quantity is the product of weight norms of each layer. In this work, we show that for a large class of networks possessing a positive homogeneity property, similar bounds may be obtained instead in terms of the norm of the product of weights. Our proof technique generalizes a recently proposed sampling argument, which allows us to demonstrate the existence of sparse approximants of positive homogeneous networks. This yields covering number bounds, which can be converted to generalization bounds for multi-class classification that are comparable to, and in certain cases improve upon, existing results in the literature. Finally, we investigate our sampling procedure empirically, which yields results consistent with our theory.

[1]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[2]  Tao Zhang,et al.  Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges , 2018, IEEE Signal Processing Magazine.

[3]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[4]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[5]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[6]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[7]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[8]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[9]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[10]  Petros Drineas,et al.  A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality , 2010, Inf. Process. Lett..

[11]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[12]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[13]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[14]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[15]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[16]  Andrew R. Barron,et al.  Approximation and Estimation for High-Dimensional Deep Learning Networks , 2018, ArXiv.

[17]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[18]  Abhisek Kundu,et al.  A Note on Randomized Element-wise Matrix Sparsification , 2014, ArXiv.

[19]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[20]  Matus Telgarsky,et al.  Approximation power of random neural networks , 2019, ArXiv.

[21]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[22]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[23]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[24]  Dan Feldman,et al.  Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds , 2018, ICLR.

[25]  Dimitris Achlioptas,et al.  Matrix Entry-wise Sampling : Simple is Best [ Extended Abstract ] , 2013 .

[26]  W. Pribitkin Simple upper bounds for partition functions , 2009 .

[27]  J. Zico Kolter,et al.  Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience , 2019, ICLR.

[28]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[29]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..