Wide stochastic networks: Gaussian limit and PAC-Bayesian training

The limit of infinite width allows for substantial simplifications in the analytical study of overparameterized neural networks. With a suitable random initialization, an extremely large network is well approximated by a Gaussian process, both before and during training. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimizes the generalization bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.

[1]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[2]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[3]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[4]  Alain Durmus,et al.  Quantitative Propagation of Chaos for SGD in Wide Neural Networks , 2020, NeurIPS.

[5]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[6]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[7]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[8]  Greg Yang,et al.  Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes , 2019, NeurIPS.

[9]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[10]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[11]  Csaba Szepesvari,et al.  Tighter risk certificates for neural networks , 2020, J. Mach. Learn. Res..

[12]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[13]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[14]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[15]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[16]  Pascal Germain,et al.  Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks , 2019, NeurIPS.

[17]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[18]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[19]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[20]  V. Bentkus A Lyapunov-type Bound in Rd , 2005 .

[21]  Arnaud Doucet,et al.  On the Impact of the Activation Function on Deep Neural Networks Training , 2019, ICML.

[22]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[23]  Judith Rousseau,et al.  Stable ResNet , 2020, ArXiv.

[24]  Benjamin Guedj,et al.  A Primer on PAC-Bayesian Learning , 2019, ICML 2019.