Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights

Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a "mean-field" factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights - if we average the MNN output using the inferred posterior.

[1]  David Saad,et al.  Training Feed Forward Nets with Binary Weights via a Modified CHIR Algorithm , 1990, Complex Syst..

[2]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[3]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[4]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[5]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[6]  Roberto Battiti,et al.  Training neural nets with the reactive tabu search , 1995, IEEE Trans. Neural Networks.

[7]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[8]  Eddy Mayoraz,et al.  Constructive Training Methods for feedforward Neural Networks with Binary weights , 1995, Int. J. Neural Syst..

[9]  Russell Beale,et al.  Handbook of Neural Computation , 1996 .

[10]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[11]  Emile Fiesler,et al.  Neural Network Adaptations to Hardware Implementations , 1997 .

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Ole Winther,et al.  Optimal perceptron learning: as online Bayesian approach , 1999 .

[14]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[15]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[16]  Riccardo Zecchina,et al.  Learning by message-passing in networks of discrete synapses , 2005, Physical review letters.

[17]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  R. Zecchina,et al.  Efficient supervised learning in networks with binary synapses , 2007, Proceedings of the National Academy of Sciences.

[20]  Brendan J. Frey,et al.  Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context , 2011, Bioinform..

[21]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[22]  Manfred Opper,et al.  Expectation Propagation with Factorizing Distributions: A Gaussian Approximation and Performance Results for Simple Models , 2011, Neural Computation.

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[24]  G. Cauwenberghs,et al.  1.1 TMACS/mW Fine-Grained Stochastic Resonant Charge-Recycling Array Processor , 2012, IEEE Sensors Journal.

[25]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[28]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..