Bayesian nonlinear model selection and neural networks: a conjugate prior approach

In order to select the best predictive neural-network architecture in a set of several candidate networks, we propose a general Bayesian nonlinear regression model comparison procedure, based on the maximization of an expected utility criterion. This criterion selects the model under which the training set achieves the highest level of internal consistency, through the predictive probability distribution of each model. The density of this distribution is computed as the model posterior predictive density and is asymptotically approximated from the assumed Gaussian likelihood of the data set and the related conjugate prior density of the parameters. The use of such a conjugate prior allows the analytic calculation of the parameter posterior and predictive posterior densities, in an empirical-Bayes-like approach. This Bayesian selection procedure allows us to compare general nonlinear regression models and in particular feedforward neural networks, in addition to embedded models as usual with asymptotic comparison tests.

[1]  Robert Hecht-Nielsen,et al.  On the Geometry of Feedforward Neural Network Error Surfaces , 1993, Neural Computation.

[2]  Dennis V. Lindley,et al.  Empirical Bayes Methods , 1974 .

[3]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[4]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[5]  D. Mackay,et al.  HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT? , 1996 .

[6]  H. Raiffa,et al.  Applied Statistical Decision Theory. , 1961 .

[7]  Gregory J. Wolff,et al.  Optimal Brain Surgeon: Extensions and performance comparisons , 1993, NIPS 1993.

[8]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[9]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[10]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[11]  S. Y. Kung,et al.  An algebraic projection analysis for optimal hidden units size and learning rates in back-propagation learning , 1988, IEEE 1988 International Conference on Neural Networks.

[12]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[13]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[14]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[15]  H. H. Thodberg,et al.  Optimal minimal neural interpretation of spectra , 1992 .

[16]  Hans Henrik Thodberg,et al.  A review of Bayesian neural networks with an application to near infrared spectroscopy , 1996, IEEE Trans. Neural Networks.

[17]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[18]  David H. Wolpert,et al.  On the Use of Evidence in Neural Networks , 1992, NIPS.

[19]  Marie Cottrell,et al.  Neural modeling for time series: A statistical stepwise method for weight elimination , 1995, IEEE Trans. Neural Networks.

[20]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[21]  J. S. Maritz,et al.  Empirical Bayes Methods , 1974 .

[22]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[23]  H. Bunke,et al.  Asymptotic results on nonlinear approximation of regression functions and weighted least squares , 1980 .

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[25]  R. Jennrich Asymptotic Properties of Non-Linear Least Squares Estimators , 1969 .

[26]  James T. Kwok,et al.  Constructive algorithms for structure learning in feedforward neural networks for regression problems , 1997, IEEE Trans. Neural Networks.

[27]  Giovanna Castellano,et al.  An iterative pruning algorithm for feedforward neural networks , 1997, IEEE Trans. Neural Networks.

[28]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.