论文信息 - PAC-Bayesian Generalization Bounds for MultiLayer Perceptrons - 字舞流文

PAC-Bayesian Generalization Bounds for MultiLayer Perceptrons

We study PAC-Bayesian generalization bounds for Multilayer Perceptrons (MLPs) with the cross entropy loss. Above all, we introduce probabilistic explanations for MLPs in two aspects: (i) MLPs formulate a family of Gibbs distributions, and (ii) minimizing the cross-entropy loss for MLPs is equivalent to Bayesian variational inference, which establish a solid probabilistic foundation for studying PAC-Bayesian bounds on MLPs. Furthermore, based on the Evidence Lower Bound (ELBO), we prove that MLPs with the cross entropy loss inherently guarantee PAC- Bayesian generalization bounds, and minimizing PAC-Bayesian generalization bounds for MLPs is equivalent to maximizing the ELBO. Finally, we validate the proposed PAC-Bayesian generalization bound on benchmark datasets.

Kenneth E. Barner | Xin Guo | Xinjie Lan | K. Barner | Xinjie Lan | Xin Guo

[1] Pierre Alquier,et al. On the properties of variational approximations of Gibbs posteriors , 2015, J. Mach. Learn. Res..

[2] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[4] Ryan P. Adams,et al. Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[5] Donald Geman,et al. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] François Laviolette,et al. PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[7] David A. McAllester. Some PAC-Bayesian Theorems , 1998, COLT' 98.

[8] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.

[9] Max Tegmark,et al. Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[10] O. Catoni. PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[11] Gintare Karolina Dziugaite,et al. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[12] Ben London,et al. A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent , 2017, NIPS.

[13] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[14] David M. Blei,et al. Variational Inference: A Review for Statisticians , 2016, ArXiv.

[15] J. Zico Kolter,et al. Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience , 2019, ICLR.

[16] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[17] John Shawe-Taylor,et al. Tighter PAC-Bayes Bounds , 2006, NIPS.

[18] Roberto Battiti,et al. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[19] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[21] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[22] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[24] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[25] Ning Qian,et al. On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[26] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[27] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[28] David A. McAllester. PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[29] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[30] J. Zico Kolter,et al. Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[31] Ryota Tomioka,et al. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[32] David A. McAllester,et al. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[33] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.

[34] David J. Schwab,et al. An exact mapping between the Variational Renormalization Group and Deep Learning , 2014, ArXiv.

[35] Lorenzo Rosasco,et al. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[36] John Shawe-Taylor,et al. Tighter PAC-Bayes bounds through distribution-dependent priors , 2013, Theor. Comput. Sci..

[37] Alexandre Lacoste,et al. PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[38] Pascal Germain,et al. Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks , 2019, NeurIPS.

[39] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..