Practical Variational Inference for Neural Networks

Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus.

[1]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[2]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[3]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[4]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[5]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[6]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[8]  C. S. Wallace,et al.  Classification by Minimum-Message-Length Inference , 1991, ICCI.

[9]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[11]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[12]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[13]  C. Lee Giles,et al.  Pruning recurrent neural networks for improved generalization performance , 1994, IEEE Trans. Neural Networks.

[14]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[15]  C. Lee Giles,et al.  An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  David Barber,et al.  Radial Basis Functions: A Bayesian Treatment , 1997, NIPS.

[18]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[19]  Charles M. Bishop,et al.  Ensemble learning in Bayesian neural networks , 1998 .

[20]  Antti Honkela,et al.  Variational learning and bits-back coding: an information-theoretic view to Bayesian learning , 2004, IEEE Transactions on Neural Networks.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[23]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008 .

[24]  Manfred Opper,et al.  The Variational Gaussian Approximation Revisited , 2009, Neural Computation.

[25]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.