PAC-Bayes Analysis Beyond the Usual Bounds

We focus on a stochastic learning model where the learner observes a finite set of training examples and the output of the learning process is a data-dependent distribution over a space of hypotheses. The learned data-dependent distribution is then used to make randomized predictions, and the high-level theme addressed here is guaranteeing the quality of predictions on examples that were not seen during training, i.e. generalization. In this setting the unknown quantity of interest is the expected risk of the data-dependent randomized predictor, for which upper bounds can be derived via a PAC-Bayes analysis, leading to PAC-Bayes bounds. Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds. We clarify the role of the requirement of fixed `data-free' priors and illustrate the use of data-dependent priors. We also present a simple bound that is valid for a loss function with unbounded range. Our analysis clarifies that those two requirements were used to upper-bound an exponential moment term, while the basic PAC-Bayes inequality remains valid with those restrictions removed.

[1]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[2]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[3]  O. Kallenberg Random Measures , 1983 .

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[6]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[7]  Yoav Freund,et al.  Self bounding learning algorithms , 1998, COLT' 98.

[8]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[9]  E. Rio Inégalités de Hoeffding pour les fonctions lipschitziennes de suites dépendantes , 2000 .

[10]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[11]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[12]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[13]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[14]  Jean-Yves Audibert A BETTER VARIANCE CONTROL FOR PAC-BAYESIAN CLASSIFICATION , 2004 .

[15]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[16]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[17]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[18]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[19]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[20]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[21]  Pierre Alquier PAC-Bayesian bounds for randomized empirical risk minimizers , 2007, 0712.1698.

[22]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[23]  Jean-Yves Audibert,et al.  Combining PAC-Bayesian and Generic Chaining Bounds , 2007, J. Mach. Learn. Res..

[24]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[25]  Gilles Blanchard,et al.  Occam's Hammer , 2006, COLT.

[26]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[27]  Arnak S. Dalalyan,et al.  Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity , 2008, Machine Learning.

[28]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[29]  Pierre Alquier,et al.  Model selection for weakly dependent time series forecasting , 2009, 0902.2924.

[30]  Naftali Tishby,et al.  PAC-Bayesian Analysis of Co-clustering and Beyond , 2010, J. Mach. Learn. Res..

[31]  Tamir Hazan,et al.  PAC-Bayesian approach for minimization of phoneme error rate , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[33]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[34]  John Shawe-Taylor,et al.  PAC-Bayesian Inequalities for Martingales , 2011, IEEE Transactions on Information Theory.

[35]  John Shawe-Taylor,et al.  Tighter PAC-Bayes bounds through distribution-dependent priors , 2013, Theor. Comput. Sci..

[36]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[37]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[38]  Yevgeny Seldin,et al.  PAC-Bayes-Empirical-Bernstein Inequality , 2013, NIPS.

[39]  T. Erven PAC-Bayes Mini-tutorial: A Continuous Union Bound , 2014, 1405.1580.

[40]  Koby Crammer,et al.  Robust Forward Algorithms via PAC-Bayes and Laplace Distributions , 2014, AISTATS.

[41]  Christoph H. Lampert,et al.  A PAC-Bayesian bound for Lifelong Learning , 2013, ICML.

[42]  François Laviolette,et al.  PAC-Bayesian Theory for Transductive Learning , 2014, AISTATS.

[43]  François Laviolette,et al.  Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm , 2015, J. Mach. Learn. Res..

[44]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[45]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[46]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[47]  Pier Giovanni Bissiri,et al.  A general framework for updating belief distributions , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[48]  Joseph Keshet,et al.  10 Perturbation Models and PAC-Bayesian Generalization Bounds , 2016 .

[49]  Pierre Alquier,et al.  On the properties of variational approximations of Gibbs posteriors , 2015, J. Mach. Learn. Res..

[50]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[51]  Marie Frei,et al.  Decoupling From Dependence To Independence , 2016 .

[52]  François Laviolette,et al.  PAC-Bayesian Bounds based on the Rényi Divergence , 2016, AISTATS.

[53]  Ben London,et al.  A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent , 2017, NIPS.

[54]  Pierre Alquier,et al.  Simpler PAC-Bayesian bounds for hostile data , 2016, Machine Learning.

[55]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[56]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[57]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[58]  Christian Igel,et al.  A Strongly Quasiconvex PAC-Bayesian Bound , 2016, ALT.

[59]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors , 2017, ICML.

[60]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[61]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[62]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[63]  Haipeng Luo,et al.  Hypothesis Set Stability and Generalization , 2019, NeurIPS.

[64]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[65]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[66]  Matthew J. Holland PAC-Bayes under potentially heavy tails , 2019, NeurIPS.

[67]  Ilja Kuzborskij,et al.  Efron-Stein PAC-Bayesian Inequalities , 2019, ArXiv.

[68]  Peter Grünwald,et al.  PAC-Bayes Un-Expected Bernstein Inequality , 2019, NeurIPS.

[69]  Christian Igel,et al.  On PAC-Bayesian bounds for random forests , 2019, Machine Learning.

[70]  Ilja Kuzborskij,et al.  Distribution-Dependent Analysis of Gibbs-ERM Principle , 2019, COLT.

[71]  Csaba Szepesvári,et al.  PAC-Bayes with Backprop , 2019, ArXiv.

[72]  Pascal Germain,et al.  Improved PAC-Bayesian Bounds for Linear Regression , 2019, AAAI.

[73]  Christian Igel,et al.  Second Order PAC-Bayesian Bounds for the Weighted Majority Vote , 2020, NeurIPS.

[74]  Pranjal Awasthi,et al.  PAC-Bayes Learning Bounds for Sample-Dependent Priors , 2020, NeurIPS.

[75]  Csaba Szepesvari,et al.  Tighter risk certificates for neural networks , 2020, J. Mach. Learn. Res..