How Tight Can PAC-Bayes be in the Small Data Regime?

In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  David Duvenaud,et al.  Inference Suboptimality in Variational Autoencoders , 2018, ICML.

[3]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2021, AISTATS.

[4]  Yee Whye Teh,et al.  Attentive Neural Processes , 2019, ICLR.

[5]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[6]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[7]  Yee Whye Teh,et al.  Neural Processes , 2018, ArXiv.

[8]  Richard E. Turner,et al.  Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes , 2020, NeurIPS.

[9]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[10]  Christian Igel,et al.  A Strongly Quasiconvex PAC-Bayesian Bound , 2016, ALT.

[11]  Richard E. Turner,et al.  The Gaussian Neural Process , 2021, ArXiv.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[14]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[15]  Andreas Krause,et al.  PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees , 2020, ICML.

[16]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[17]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[18]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[19]  Richard E. Turner,et al.  Convolutional Conditional Neural Processes , 2019, ICLR.

[20]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[21]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[22]  Gilles Blanchard,et al.  Occam's Hammer , 2006, COLT.

[23]  François Laviolette,et al.  PAC-Bayesian Bounds based on the Rényi Divergence , 2016, AISTATS.

[24]  Ron Meir,et al.  Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory , 2017, ICML.

[25]  Csaba Szepesvari,et al.  Tighter risk certificates for neural networks , 2020, J. Mach. Learn. Res..

[26]  Csaba Szepesvári,et al.  PAC-Bayes with Backprop , 2019, ArXiv.

[27]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[28]  Christian Igel,et al.  Second Order PAC-Bayesian Bounds for the Weighted Majority Vote , 2020, NeurIPS.

[29]  John Shawe-Taylor,et al.  PAC-Bayesian Inequalities for Martingales , 2011, IEEE Transactions on Information Theory.

[30]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[31]  Anirudha Majumdar,et al.  PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments , 2018, CoRL.

[32]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[33]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[34]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[35]  Amaury Habrard,et al.  A General Framework for the Derandomization of PAC-Bayesian Bounds , 2021, ArXiv.

[36]  Jie Lu,et al.  PAC-Bayes Bounds for Meta-learning with Data-Dependent Prior , 2021, ArXiv.

[37]  Pierre Alquier,et al.  Simpler PAC-Bayesian bounds for hostile data , 2016, Machine Learning.

[38]  Anirudha Majumdar,et al.  PAC-BUS: Meta-Learning Bounds via PAC-Bayes and Uniform Stability , 2021, ArXiv.

[39]  Yevgeny Seldin,et al.  PAC-Bayes-Empirical-Bernstein Inequality , 2013, NIPS.

[40]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[41]  Ilja Kuzborskij,et al.  PAC-Bayes Analysis Beyond the Usual Bounds , 2020, NeurIPS.

[42]  François Laviolette,et al.  Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm , 2015, J. Mach. Learn. Res..

[43]  Yee Whye Teh,et al.  Conditional Neural Processes , 2018, ICML.

[44]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.