Predicting the Generalization Gap in Deep Networks with Margin Distributions

As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.

[1]  J. Tukey,et al.  Variations of Box Plots , 1978 .

[2]  S. Glantz,et al.  Primer of Applied Regression & Analysis of Variance , 1990 .

[3]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[4]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[7]  Dan Roth,et al.  On generalization bounds, projection profile, and margin distribution , 2002, ICML.

[8]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[9]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Tie-Yan Liu,et al.  Large Margin Deep Neural Networks: Theory and Algorithms , 2015, ArXiv.

[12]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Meng Yang,et al.  Large-Margin Softmax Loss for Convolutional Neural Networks , 2016, ICML.

[16]  Ananthram Swami,et al.  Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples , 2016, ArXiv.

[17]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[18]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[19]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[20]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[21]  Aleksander Madry,et al.  A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations , 2017, ArXiv.

[22]  Zhi-Hua Zhou,et al.  Multi-Class Optimal Margin Distribution Machine , 2017, ICML.

[23]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[24]  Shengcai Liao,et al.  Soft-Margin Softmax for Deep Classification , 2017, ICONIP.

[25]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[26]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[27]  Hossein Mobahi,et al.  Large Margin Deep Networks for Classification , 2018, NeurIPS.

[28]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[29]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[30]  Zhi-Hua Zhou,et al.  Optimal Margin Distribution Clustering , 2018, AAAI.

[31]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[32]  Tomaso A. Poggio,et al.  A Surprising Linear Relationship Predicts Test Performance in Deep Networks , 2018, ArXiv.

[33]  Zhi-Hua Zhou,et al.  Optimal Margin Distribution Network , 2018, ArXiv.

[34]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[35]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[36]  Yair Weiss,et al.  Why do deep convolutional networks generalize so poorly to small image transformations? , 2018, J. Mach. Learn. Res..