What Size Neural Network Gives Optimal Generalization? Convergence Properties of Backpropagation

One of the most important aspects of any machine learning paradigm is how i t scales according to problem size and complexity. Using a task with known optimal train ing error, and a pre-specified maximum number of training updates, we investigate the convergence of th backpropagation algorithm with respect to a) the complexity of the required function approximatio n, b) the size of the network in relation to the size required for an optimal solution, and c) the degree o f noise in the training data. In general, for a) the solution found is worse when the function to be app roximated is more complex, for b) oversized networks can result in lower training and generalization error in certain cases, and for c) the use of committee or ensemble techniques can be more beneficial as the level o f noise in the training data is increased. For the experiments we performed, we do not obtain the optimal solution in any case. We further support the observation that larger networks can produce bett er training and generalization error using a face recognition example where a network with many more par ameters than training points generalizes better than smaller networks.

[1]  H. Akaike Statistical predictor identification , 1970 .

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[5]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[6]  R. Hecht-Nielsen,et al.  Back propagation error surfaces can have local minima , 1989, International 1989 Joint Conference on Neural Networks.

[7]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[8]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[9]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[10]  William H. Press,et al.  Numerical recipes , 1990 .

[11]  V. Kůrková Kolmogorov's Theorem Is Relevant , 1991, Neural Comput..

[12]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[13]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[14]  Panos J. Antsaklis,et al.  A simple method to derive bounds on the size and to train multilayer neural networks , 1991, IEEE Trans. Neural Networks.

[15]  Vra Krkov Kolmogorov's Theorem Is Relevant , 1991, Neural Computation.

[16]  Steve Renals,et al.  Connectionist probability estimation in the DECIPHER speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Alberto Tesi,et al.  On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Xiao-Hu Yu,et al.  Can backpropagation error surface not have local minima , 1992, IEEE Trans. Neural Networks.

[19]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[20]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[21]  Andrew D. Back New techniques for nonlinear system identification : a rapprochement between neural networks and linear systems , 1992 .

[22]  Gerald Tesauro,et al.  How Tight Are the Vapnik-Chervonenkis Bounds? , 1992, Neural Computation.

[23]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[24]  Peter L. Bartlett,et al.  Vapnik-Chervonenkis Dimension Bounds for Two- and Three-Layer Networks , 1993, Neural Computation.

[25]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[26]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[27]  Ying Li,et al.  Why Some Feedforward Networks Cannot Learn Some Polynomials , 1994, Neural Computation.

[28]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[29]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[30]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[31]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.

[32]  Brian D. Ripley,et al.  Statistical Ideas for Selecting Network Architectures , 1995, SNN Symposium on Neural Networks.

[33]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[34]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[35]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[36]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[37]  Len Hamey Analysis of the error surface of the XOR network with two hidden nodes , 1996 .

[38]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .

[39]  Klaus Schulten,et al.  A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks , 1996, Neural Computation.

[40]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[41]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[42]  Shun-ichi Amari,et al.  Learning and statistical inference , 1998 .

[43]  Wolfgang Maass,et al.  Vapnik-Chervonenkis dimension of neural networks , 1998 .

[44]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.