Model Selection for Neural Network Classification

Classification rates on out-of-sample predictions can often be improved through the use of model selection when fitting a model on the training data. Using correlated predictors or fitting a model of too high a dimensionality can lead to overfitting, which in turn leads to poor out-of-sample performance. I will discuss methodology using the Bayesian Information Criterion (BIC) of Schwarz (1978) that can search over large model spaces and find appropriate models that reduce the danger of overfitting. The methodology can be interpreted as either a frequentist method with a Bayesian inspiration or as a Bayesian method based on noninformative priors.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Herbert K. H. Lee,et al.  Model selection and model averaging for neural networks , 1998 .

[3]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[4]  Peter Müller,et al.  Feedforward Neural Networks for Nonparametric Regression , 1998 .

[5]  A. Raftery Approximate Bayes factors and accounting for model uncertainty in generalised linear models , 1996 .

[6]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[7]  Edward E. Leamer,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[8]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[9]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[10]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[11]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[12]  Christophe Andrieu,et al.  Robust Full Bayesian Learning for Neural Networks , 1999 .

[13]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[14]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[15]  Herbert K. H. Lee Consistency of posterior distributions for neural networks , 2000, Neural Networks.

[16]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[17]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[18]  H. Akaike A new look at the statistical model identification , 1974 .

[19]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[20]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .