Lessons in Neural Network Training: Overfitting May be Harder than Expected

For many reasons, neural networks have become very popular AI machine learning models. Two of the most important aspects of machine learning models are how well the model generalizes to unseen data, and how well the model scales with problem complexity. Using a controlled task with known optimal training error, we investigate the convergence of the backpropagation (BP) algorithm. We find that the optimal solution is typically not found. Furthermore, we observe that networks larger than might be expected can result in lower training and generalization error. This result is supported by another real world example. We further investigate the training behavior by analyzing the weights in trained networks (excess degrees of freedom are seen to do little harm and to aid convergence), and contrasting the interpolation characteristics of multi-layer perceptron neural networks (MLPs) and polynomial models. (overfitting behavior is very different - the MLP is often biased towards smoother solutions). Finally, we analyze relevant theory outlining the reasons for significant practical differences. These results bring into question common beliefs about neural network training regarding convergence and optimal network size, suggest alternate guidelines for practical use (lower fear of excess degrees of freedom), and help to direct future work (e.g. methods for creation of more parsimonious solutions, importance of the MLP/BP bias and possibly worse performance of "improved" training algorithms).

[1]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[2]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[5]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[6]  Brian D. Ripley,et al.  Statistical Ideas for Selecting Network Architectures , 1995, SNN Symposium on Neural Networks.

[7]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[8]  C. Lee Giles,et al.  What Size Neural Network Gives Optimal Generalization? Convergence Properties of Backpropagation , 1998 .

[9]  C. Lee Giles,et al.  Presenting and Analyzing the Results of AI Experiments: Data Averaging and Data Snooping , 1997, AAAI/IAAI.

[10]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[11]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[12]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[13]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[14]  Xiao-Hu Yu,et al.  Can backpropagation error surface not have local minima , 1992, IEEE Trans. Neural Networks.

[15]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.

[16]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[17]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.