论文信息 - Overfitting and neural networks: conjugate gradient and backpropagation

Overfitting and neural networks: conjugate gradient and backpropagation

Methods for controlling the bias/variance tradeoff typically assume that overfitting or overtraining is a global phenomenon. For multi-layer perceptron (MLP) neural networks, global parameters such as the training time, network size, or the amount of weight decay are commonly used to control the bias/variance tradeoff. However, the degree of overfitting can vary significantly throughout the input space of the model. We show that overselection of the degrees of freedom for an MLP trained with backpropagation can improve the approximation in regions of underfitting, while not significantly overfitting in other regions. This can be a significant advantage over other models. Furthermore, we show that "better" learning algorithms such as conjugate gradient can in fact lead to worse generalization, because they can be more prone to creating varying degrees of overfitting in different regions of the input space. While experimental results can not cover all practical situations, our results do help to explain common behavior that does not agree with theoretical expectations. Our results suggest one important reason for the relative success of MLPs, bring into question common beliefs about neural network training regarding training algorithms, overfitting, and optimal network size, suggest alternate guidelines for practical use, and help to direct future work.

C. Lee Giles | Steve Lawrence | S. Lawrence

[1] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[2] David H. Wolpert,et al. On Bias Plus Variance , 1997, Neural Computation.

[3] Peter L. Bartlett,et al. For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[4] John E. Moody,et al. The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[5] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[6] Yoshua Bengio,et al. Neural networks for speech and sequence recognition , 1996 .

[7] David E. Rumelhart,et al. Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[8] John E. Moody,et al. Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[9] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[10] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.