On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario

In considering a statistical model selection of neural networks and radial basis functions under an overrealizable case, the problem of unidentifiability emerges. Because the model selection criterion is an unbiased estimator of the generalization error based on the training error, this article analyzes the expected training error and the expected generalization error of neural networks and radial basis functions in overrealizable cases and clarifies the difference from regular models, for which identifiability holds. As a special case of an overrealizable scenario, we assumed a gaussian noise sequence as training data. In the least-squares estimation under this assumption, we first formulated the problem, in which the calculation of the expected errors of unidentifiable networks is reduced to the calculation of the expectation of the supremum of thex2 process. Under this formulation, we gave an upper bound of the expected training error and a lower bound of the expected generalization error, where the generalization is measured at a set of training inputs. Furthermore, we gave stochastic bounds on the training error and the generalization error. The obtained upper bound of the expected training error is smaller than in regular models, and the lower bound of the expected generalization error is larger than in regular models. The result tells us that the degree of overfitting in neural networks and radial basis functions is higher than in regular models. Correspondingly, it also tells us that the generalization capability is worse than in the case of regular models. The article may be enough to show a difference between neural networks and regular models in the context of the least-squares estimation in a simple situation. This is a first step in constructing a model selection criterion in an overrealizable case. Further important problems in this direction are also included in this article.

[1]  L. Breiman Bias-variance, regularization, instability and stabilization , 1998 .

[2]  Katsuyuki Hagiwara,et al.  On the problem of applying AIC to determine the structure of a layered feedforward neural network , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[6]  C. Mallows More comments on C p , 1995 .

[7]  N. Murata,et al.  Bias of Estimators and Regularization Terms Noboru Murata , 1998 .

[8]  E. Gassiat,et al.  Testing in locally conic models, and application to mixture models , 1997 .

[9]  C. L. Mallows Some comments on C_p , 1973 .

[10]  Katsuyuki Hagiwara,et al.  Upper bound of the expected training error of neural network regression for a Gaussian noise sequence , 2001, Neural Networks.

[11]  M. R. Leadbetter,et al.  Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics , 1983 .

[12]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[13]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[14]  E. Gassiat,et al.  Testing the order of a model using locally conic parametrization : population mixtures and stationary ARMA processes , 1999 .

[15]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[17]  Hideaki Sakai An application of a BIC-type method to harmonic analysis and a new criterion for order determination of an AR process , 1990, IEEE Trans. Acoust. Speech Signal Process..

[18]  Hirotugu Akaike,et al.  On entropy maximization principle , 1977 .

[19]  K. Fukumizu Special Statistical Properties of Neural Network Learning , 1997 .

[20]  Katsuyuki Hagiwara,et al.  Upper Bounds on the Expected Training Errors of Neural Networks Regressions for a Gaussian Noise , 1998, ICONIP.

[21]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[22]  S. Amari,et al.  Differential and Algebraic Geometry of Multilayer Perceptrons , 2001 .

[23]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[24]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[25]  S. Resnick Extreme Values, Regular Variation, and Point Processes , 1987 .

[26]  K. Fukumizu Likelihood ratio of unidentifiable models and multilayer neural networks , 2003 .

[27]  Naohiro Toda,et al.  Kick-out learning algorithm to reduce the oscillation of weights , 1994, Neural Networks.

[28]  J. Hartigan A failure of likelihood asymptotics for normal mixtures , 1985 .

[29]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[30]  H. Akaike Fitting autoregressive models for prediction , 1969 .

[31]  Katsuyuki Hagiwara,et al.  On the least square error and prediction square error of function representation with discrete variable basis , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.