Generalization error bounds using Wasserstein distances

Generalization error of a learning algorithm characterizes the gap between an algorithm’s performance on test data versus performance on training data. In recent work, Xu & Raginsky [1] showed that generalization error may be upper- bounded using the mutual information $I(S;W)$ between the input $S$ and the output $W$ of an algorithm. In this paper, we derive upper bounds on the generalization error in terms of a certain Wasserstein distance involving the distributions of $S$ and $W$ under the assumption of a Lipschitz continuous loss function. Unlike mutual information-based bounds, these new bounds are useful even for deterministic learning algorithms, or for algorithms such as stochastic gradient descent. Moreover, we show that in some natural cases these bounds are tighter than mutual information-based bounds.

[1]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[2]  Varun Jog,et al.  Generalization Error Bounds for Noisy, Iterative Algorithms , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[3]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[4]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[5]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[6]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[7]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[8]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[9]  Ben London Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent , 2016 .

[10]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[11]  Igal Sason,et al.  Concentration of Measure Inequalities in Information Theory, Communications, and Coding , 2012, Found. Trends Commun. Inf. Theory.

[12]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[13]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[14]  C. Villani Topics in Optimal Transportation , 2003 .

[15]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[16]  Kai Zheng,et al.  Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[19]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.