Average-Case Information Complexity of Learning

How many bits of information are revealed by a learning algorithm for a concept class of VC-dimension $d$? Previous works have shown that even for $d=1$ the amount of information may be unbounded (tend to $\infty$ with the universe size). Can it be that all concepts in the class require leaking a large amount of information? We show that typically concepts do not require leakage. There exists a proper learning algorithm that reveals $O(d)$ bits of information for most concepts in the class. This result is a special case of a more general phenomenon we explore. If there is a low information learner when the algorithm {\em knows} the underlying distribution on inputs, then there is a learner that reveals little information on an average concept {\em without knowing} the distribution on inputs.

[1]  Thomas Steinke,et al.  Calibrating Noise to Variance in Adaptive Data Analysis , 2017, COLT.

[2]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[3]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[4]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[5]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[6]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[7]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[8]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[9]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[10]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[11]  Shay Moran,et al.  Sample compression schemes for VC classes , 2015, 2016 Information Theory and Applications Workshop (ITA).

[12]  Rüdiger Reischuk,et al.  A Complete and Tight Average-Case Analysis of Learning Monomials , 1999, STACS.

[13]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[14]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[15]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[16]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[17]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[18]  Abbas El Gamal,et al.  Network Information Theory , 2021, 2021 IEEE 3rd International Conference on Advanced Trends in Information Theory (ATIT).

[19]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[22]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[23]  R. Servedio,et al.  Learning, cryptography, and the average case , 2010 .

[24]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[25]  Amir Yehudayoff,et al.  A Direct Sum Result for the Information Complexity of Learning , 2018, COLT.

[26]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[27]  Aaron Roth,et al.  Max-Information, Differential Privacy, and Post-selection Hypothesis Testing , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).