Overfitting and neural networks: conjugate gradient and backpropagation

Methods for controlling the bias/variance tradeoff typically assume that overfitting or overtraining is a global phenomenon. For multi-layer perceptron (MLP) neural networks, global parameters such as the training time, network size, or the amount of weight decay are commonly used to control the bias/variance tradeoff. However, the degree of overfitting can vary significantly throughout the input space of the model. We show that overselection of the degrees of freedom for an MLP trained with backpropagation can improve the approximation in regions of underfitting, while not significantly overfitting in other regions. This can be a significant advantage over other models. Furthermore, we show that "better" learning algorithms such as conjugate gradient can in fact lead to worse generalization, because they can be more prone to creating varying degrees of overfitting in different regions of the input space. While experimental results can not cover all practical situations, our results do help to explain common behavior that does not agree with theoretical expectations. Our results suggest one important reason for the relative success of MLPs, bring into question common beliefs about neural network training regarding training algorithms, overfitting, and optimal network size, suggest alternate guidelines for practical use, and help to direct future work.