1 Generalization in Classical Statistical Learning Theory

We derive upper bounds on the generalization error of learning algorithms based on their algorithmic transport cost: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. The bounds provide a novel approach to study the generalization of learning algorithms from an optimal transport view and impose less constraints on the loss function, such as sub-gaussian or bounded. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy (or KL-divergence), and VC dimension, thus further bridging optimal transport theory and information theory with statistical learning theory. Moreover, we also study di erent conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by di erent probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we analyze the generalization in deep learning and conclude that the generalization error in deep neural networks (DNNs) decreases exponentially to zero as the number of layers increases. Our analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of f -divergence, which may be of independent interest in analyzing other learning models with hierarchical structure. ∗UBTECH Sydney AI Centre and the School of Information Technologies in the Faculty of Engineering and Information Technologies at The University of Sydney, NSW, 2006, Australia, zjin8228@uni.sydney.edu.au, tongliang.liu@sydney.edu.au, dacheng.tao@sydney.edu.au. 1 ar X iv :1 81 1. 03 27 0v 1 [ st at .M L ] 8 N ov 2 01 8

[1]  L. Cam,et al.  Théorie asymptotique de la décision statistique , 1969 .

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[6]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[7]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[8]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[9]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[10]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[11]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[12]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[13]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[14]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[15]  C. Villani Optimal Transport: Old and New , 2008 .

[16]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[17]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[18]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[19]  Ibrahim M. Alabdulmohsin Algorithmic Stability and Uniform Generalization , 2015, NIPS.

[20]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[21]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[22]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[23]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[24]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[25]  Sergio Verdú,et al.  $f$ -Divergence Inequalities , 2015, IEEE Transactions on Information Theory.

[26]  Stephen E. Fienberg,et al.  On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms , 2016, PSD.

[27]  Yihong Wu,et al.  Strong data-processing inequalities for channels and Bayesian networks , 2015, 1508.06025.

[28]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29]  Ibrahim M. Alabdulmohsin An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability , 2017, AISTATS.

[30]  Jaeho Lee,et al.  Minimax Statistical Learning and Domain Adaptation with Wasserstein Distances , 2017, ArXiv.

[31]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Dacheng Tao,et al.  Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[33]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[34]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[35]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[36]  Ibrahim M. Alabdulmohsin Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization , 2018, ICML.

[37]  Amir Yehudayoff,et al.  A Direct Sum Result for the Information Complexity of Learning , 2018, COLT.

[38]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[39]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[40]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..