Information-theoretic analysis of generalization capability of learning algorithms

We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The bounds provide an information-theoretic understanding of generalization in learning problems, and give theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information. We propose a number of methods for this purpose, among which are algorithms that regularize the ERM algorithm with relative entropy or with random noise. Our work extends and leads to nontrivial improvements on the recent results of Russo and Zou.

[1]  Ibrahim M. Alabdulmohsin Algorithmic Stability and Uniform Generalization , 2015, NIPS.

[2]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[3]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[4]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[5]  Ibrahim M. Alabdulmohsin An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability , 2017, AISTATS.

[6]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[7]  Stephen E. Fienberg,et al.  A Minimax Theory for Adaptive Data Analysis , 2016, ArXiv.

[8]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[9]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[10]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[11]  Stephen E. Fienberg,et al.  On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms , 2016, PSD.

[12]  Maxim Raginsky,et al.  Strong Data Processing Inequalities and $\Phi $ -Sobolev Inequalities for Discrete Channels , 2014, IEEE Transactions on Information Theory.

[13]  P. R. Kumar,et al.  Learning by canonical smooth estimation. I. Simultaneous estimation , 1996, IEEE Trans. Autom. Control..

[14]  Sergio Verdu,et al.  The exponential distribution in information theory , 1996 .

[15]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[16]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[17]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[18]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[19]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[20]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[21]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[22]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .