Revisiting generalization for deep learning: PAC-Bayes, flat minima, and generative models

In this work, we construct generalization bounds to understand existing learning algorithms and propose new ones. Generalization bounds relate empirical performance to future expected performance. The tightness of these bounds vary widely, and depends on the complexity of the learning task and the amount of data available, but also on how much information the bounds take into consideration. We are particularly concerned with data and algorithmdependent bounds that are quantitatively nonvacuous. We begin with an analysis of stochastic gradient descent (SGD) in supervised learning. By formalizing the notion of flat minima using PAC-Bayes generalization bounds, we obtain nonvacuous generalization bounds for stochastic classifiers based on SGD solutions. Despite strong empirical performance in many settings, SGD rapidly overfits in others. By combining nonvacuous generalization bounds and structural risk minimization, we arrive at an algorithm that trades-off accuracy and generalization guarantees. We also study generalization in the context of unsupervised learning. We propose to use a two sample test statistic for training neural network generator models and bound the gap between the population and the empirical estimate of the statistic.

[1]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[2]  Eduardo D. Sontag,et al.  Neural Networks with Quadratic VC Dimension , 1995, J. Comput. Syst. Sci..

[3]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[4]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[5]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[6]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[7]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[8]  Daniel Kifer,et al.  Private Convex Empirical Risk Minimization and High-dimensional Regression , 2012, COLT 2012.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  Pier Giovanni Bissiri,et al.  A general framework for updating belief distributions , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[11]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[12]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[13]  Peter L. Bartlett,et al.  Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[14]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[15]  Christos Dimitrakakis,et al.  Robust and Private Bayesian Inference , 2013, ALT.

[16]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[17]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[18]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[19]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[20]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[21]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[22]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[23]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[24]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[25]  M. Tanner,et al.  Gibbs posterior for variable selection in high-dimensional classification and data mining , 2008, 0810.5655.

[26]  Andrew Blake,et al.  Visual Reconstruction , 1987, Deep Learning for EEG-Based Brain–Computer Interfaces.

[27]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[28]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[29]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[30]  Rebecca N. Wright,et al.  Differential privacy: an exploration of the privacy-utility landscape , 2013 .

[31]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[32]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[33]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[34]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[35]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[36]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[37]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[38]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[39]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[40]  Paul W. Goldberg,et al.  Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers , 1993, COLT '93.

[41]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[42]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[43]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[44]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[45]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[46]  John Shawe-Taylor,et al.  Tighter PAC-Bayes bounds through distribution-dependent priors , 2013, Theor. Comput. Sci..

[47]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[48]  Christian Igel,et al.  A Strongly Quasiconvex PAC-Bayesian Bound , 2016, ALT.

[49]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[50]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  François Laviolette,et al.  PAC-Bayesian Bounds based on the Rényi Divergence , 2016, AISTATS.

[52]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[53]  Davide Anguita,et al.  Differential privacy and generalization: Sharper bounds with applications , 2017, Pattern Recognit. Lett..

[54]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[55]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[56]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[57]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[58]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[59]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[60]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[61]  Carlo Baldassi,et al.  Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses. , 2015, Physical review letters.

[62]  Peter Grünwald,et al.  Fast Rates for General Unbounded Loss Functions: From ERM to Generalized Bayes , 2016, J. Mach. Learn. Res..

[63]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[64]  Eric Saund,et al.  Dimensionality-Reduction Using Connectionist Networks , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[66]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[67]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[68]  D. Mackay,et al.  Bayesian neural networks and density networks , 1995 .

[69]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[70]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[71]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[72]  Peter Grünwald,et al.  Fast Rates with Unbounded Losses , 2016, ArXiv.

[73]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[74]  Hiroshi Nakagawa,et al.  Differential Privacy without Sensitivity , 2016, NIPS.

[75]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[76]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[77]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[78]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[79]  Kenji Yamanishi Extended Stochastic Complexity and Minimax Relative Loss Analysis , 1999, ALT.

[80]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[81]  Tengyu Ma,et al.  On the Ability of Neural Nets to Express Distributions , 2017, COLT.

[82]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[83]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[84]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[85]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[86]  Alexander J. Smola,et al.  Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo , 2015, ICML.

[87]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[88]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[89]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[90]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[91]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[92]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[93]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.

[94]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[95]  John Langford,et al.  Quantitatively tight sample complexity bounds , 2002 .

[96]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[97]  Pierre Alquier,et al.  Simpler PAC-Bayesian bounds for hostile data , 2016, Machine Learning.

[98]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[99]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[100]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[101]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..