Measuring Generalization with Optimal Transport

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at https://github.com/chingyaoc/kV-Margin.

[1]  Kristjan H. Greenewald,et al.  k-Variance: A Clustered Notion of Variance , 2020, SIAM J. Math. Data Sci..

[2]  Gi and Pal Scores: Deep Neural Network Generalization Statistics , 2021, ArXiv.

[3]  Alessandro Rudi,et al.  A Dimension-free Computational Upper-bound for Smooth Optimal Transport Estimation , 2021, COLT.

[4]  S. Mallat,et al.  Separation and Concentration in Deep Networks , 2020, ICLR.

[5]  S. Sra,et al.  Contrastive Learning with Hard Negative Samples , 2020, ICLR.

[6]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[7]  Nicolas Courty,et al.  POT: Python Optimal Transport , 2021, J. Mach. Learn. Res..

[8]  Hossein Mobahi,et al.  NeurIPS 2020 Competition: Predicting Generalization in Deep Learning , 2020, ArXiv.

[9]  Manik Sharma,et al.  Representation Based Complexity Measures for Predicting Generalization in Deep Learning , 2020, ArXiv.

[10]  Vincent Gripon,et al.  Ranking Deep Learning Generalization using Label Variation in Latent Geometry Graphs , 2020, ArXiv.

[11]  Ioannis Mitliagkas,et al.  In Search of Robust Measures of Generalization , 2020, NeurIPS.

[12]  Ching-Yao Chuang,et al.  Estimating Generalization under Distribution Shifts via Domain-Invariant Representations , 2020, ICML.

[13]  Ching-Yao Chuang,et al.  Debiased Contrastive Learning , 2020, NeurIPS.

[14]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[15]  Alexandros G. Dimakis,et al.  Exactly Computing the Local Lipschitz Constant of ReLU Networks , 2020, NeurIPS.

[16]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[17]  Emma J. Chory,et al.  A Deep Learning Approach to Antibiotic Discovery , 2020, Cell.

[18]  Abhinav Gupta,et al.  ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[20]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Daniel M. Roy,et al.  Methods and Analysis of The First Competition in Predicting Generalization of Deep Learning , 2020, NeurIPS.

[22]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[23]  Tengyu Ma,et al.  Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin , 2019, ArXiv.

[24]  Manfred Morari,et al.  Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks , 2019, NeurIPS.

[25]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[26]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[27]  J. Zico Kolter,et al.  Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.

[28]  Hossein Mobahi,et al.  Predicting the Generalization Gap in Deep Networks with Margin Distributions , 2018, ICLR.

[29]  Maya R. Gupta,et al.  Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints , 2018, ICML.

[30]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[31]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[32]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[33]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[34]  Kevin Scaman,et al.  Lipschitz regularity of deep neural networks: analysis and efficient estimation , 2018, NeurIPS.

[35]  Hossein Mobahi,et al.  Large Margin Deep Networks for Classification , 2018, NeurIPS.

[36]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[37]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[38]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[39]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[40]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[41]  Xi Chen,et al.  Wasserstein Distributionally Robust Optimization and Variation Regularization , 2017, Operations Research.

[42]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[43]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[44]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[45]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[46]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[47]  Yuval Rabani,et al.  On Lipschitz extension from finite subsets , 2015, 1506.04398.

[48]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[49]  M. Mohri,et al.  Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes , 2015 .

[50]  A. Guillin,et al.  On the rate of convergence in Wasserstein distance of the empirical measure , 2013, 1312.2128.

[51]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[52]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[53]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[54]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[55]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[56]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[57]  T. O’Neil Geometric Measure Theory , 2002 .

[58]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[59]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[60]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .