Information-Theoretic Bayes Risk Lower Bounds for Realizable Models

We derive information-theoretic lower bounds on the Bayes risk and generalization error of realizable machine learning models. In particular, we employ an analysis in which the rate-distortion function of the model parameters bounds the required mutual information between the training samples and the model parameters in order to learn a model up to a Bayes risk constraint. For realizable models, we show that both the rate distortion function and mutual information admit expressions that are convenient for analysis. For models that are (roughly) lower Lipschitz in their parameters, we bound the rate distortion function from below, whereas for VC classes, the mutual information is bounded above by dvc log(n). When these conditions match, the Bayes risk with respect to the zero-one loss scales no faster than Ω(dvc/n), which matches known outer bounds and minimax lower bounds up to logarithmic factors. We also consider the impact of label noise, providing lower bounds when training and/or test samples are corrupted.

[1]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[2]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[3]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[4]  Thomas Steinke,et al.  PAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classes , 2021, COLT.

[5]  Mahdieh Soleymani Baghshah,et al.  Rate-Distortion Analysis of Minimum Excess Risk in Bayesian Learning , 2021, ICML.

[6]  P. Bartlett,et al.  Empirical minimization , 2006 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[9]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[10]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[11]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[12]  A. Robert Calderbank,et al.  Rate-distortion bounds on Bayes risk in supervised learning , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[13]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[14]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[15]  Richard G. Baraniuk,et al.  A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning , 2021, ArXiv.

[16]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[17]  M. Raginsky,et al.  Minimum Excess Risk in Bayesian Learning , 2020, IEEE Transactions on Information Theory.

[18]  E. Mammen,et al.  Asymptotical minimax recovery of sets with smooth boundaries , 1995 .

[19]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[20]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[21]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2021, AISTATS.

[22]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.