A Practical Bayesian Framework for Backprop Networks

A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures; (2) objective stopping rules for deletion of weights; (3) objective choice of magnitude and type of weight decay terms or additive regularisers (for penalising large weights, etc.); (4) a measure of the e ective number of well{determined parameters in a model; (5) quanti ed estimates of the error bars on network parameters and on network output; (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian `evidence' automatically embodies `Occam's razor,' penalising over{ exible and over{complex architectures. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well{ matched to a problem, a good correlation between generalisation ability and the Bayesian evidence is obtained.

[1]  A. M. Walker On the Asymptotic Behaviour of Posterior Distributions , 1969 .

[2]  Geoffrey E. Hinton,et al.  OPTIMAL PERCEPTUAL INFERENCE , 1983 .

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[5]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[6]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[7]  S. P. Luttrell,et al.  The Use of Bayesian and Entropic Methods in Neural Network Theory , 1989 .

[8]  J. Skilling The Eigenvalues of Mega-dimensional Matrices , 1989 .

[9]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[10]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[11]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[12]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[13]  Fernando J. Pineda,et al.  Recurrent Backpropagation and the Dynamical Approach to Adaptive Neural Computation , 1989, Neural Computation.

[14]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[15]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[16]  Yann LeCun,et al.  Transforming Neural-Net Output Levels to Probability Distributions , 1990, NIPS.

[17]  Chuanyi Ji,et al.  Generalizing Smoothness Constraints from Discrete Samples , 1990, Neural Computation.

[18]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[19]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[20]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[21]  Chris Bishop,et al.  Exact Calculation of the Hessian Matrix for the Multilayer Perceptron , 1992, Neural Computation.

[22]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[23]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.