Expectation propagation: a probabilistic view of Deep Feed Forward Networks

We present a statistical mechanics model of deep feed forward neural networks (FFN). Our energy-based approach naturally explains several known results and heuristics, providing a solid theoretical framework and new instruments for a systematic development of FFN. We infer that FFN can be understood as performing three basic steps: encoding, representation validation and propagation. We obtain a set of natural activations -- such as sigmoid, $\tanh$ and ReLu -- together with a state-of-the-art one, recently obtained by Ramachandran et al.(arXiv:1710.05941) using an extensive search algorithm. We term this activation ESP (Expected Signal Propagation), explain its probabilistic meaning, and study the eigenvalue spectrum of the associated Hessian on classification tasks. We find that ESP allows for faster training and more consistent performances over a wide range of network architectures.

[1]  Zohar Ringel,et al.  Mutual information, neural networks and the renormalization group , 2017, ArXiv.

[2]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[3]  David J. Schwab,et al.  An exact mapping between the Variational Renormalization Group and Deep Learning , 2014, ArXiv.

[4]  M. Mézard,et al.  The Bethe lattice spin glass revisited , 2000, cond-mat/0009418.

[5]  M. Cassandro,et al.  Critical point behaviour and probability theory , 1978 .

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[8]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[9]  Irene Giardina,et al.  Random Fields and Spin Glasses: A Field Theory Approach , 2010 .

[10]  R. Zecchina,et al.  Inverse statistical problems: from the inverse Ising problem to data science , 2017, 1702.01522.

[11]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[12]  Erio Tosatti Statistical Mechanics and Applications in Condensed Matter , 2016 .

[13]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[14]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[15]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[16]  Shang‐keng Ma Modern Theory of Critical Phenomena , 1976 .

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[19]  T. Toffoli Physics and computation , 1982 .

[20]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[21]  Hwee Kuan Lee,et al.  Distribution Regression Network , 2018, ArXiv.

[22]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[23]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[24]  Fabrice Mortessagne,et al.  Equilibrium and Non-Equilibrium Statistical Thermodynamics: Contents , 2004 .

[25]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[26]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[27]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[28]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[29]  Arnaud Doucet,et al.  On the Selection of Initialization and Activation Function for Deep Neural Networks , 2018, ArXiv.

[30]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[31]  J. Cardy,et al.  Quantum quenches in extended systems , 2007, 0704.1880.

[32]  Daniel J. Amit,et al.  Modeling brain function: the world of attractor neural networks, 1st Edition , 1989 .

[33]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[34]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[35]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..