On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions

Feedforward networks together with their training algorithms are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the generalization error can be decomposed into two terms: the approximation error, due to the insufficient representational capacity of a finite sized network, and the estimation error, due to insufficient information about the target function because of the finite number of samples. We then consider the problem of learning functions belonging to certain Sobolev spaces with gaussian radial basis functions. Using the above-mentioned decomposition we bound the generalization error in terms of the number of basis functions and number of examples. While the bound that we derive is specific for radial basis functions, a number of observations deriving from it apply to any approximation technique. Our result also sheds light on ways to choose an appropriate network architecture for a particular problem and the kinds of problems that can be effectively solved with finite resources, i.e., with a finite number of parameters and finite amounts of data.

[1]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[2]  David A. Cohn,et al.  Can Neural Networks Do Better Than the Vapnik-Chervonenkis Bounds? , 1990, NIPS.

[3]  F. Girosi,et al.  Convergence Rates of Approximation by Translates , 1992 .

[4]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[5]  Chuanyi Ji,et al.  The VC-Dimension versus the Statistical Capacity of Multilayer Networks , 1991, NIPS.

[6]  A. Pinkus n-Widths in Approximation Theory , 1985 .

[7]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[8]  F. Girosi Some extensions of radial basis functions and their applications in artificial intelligence , 1992 .

[9]  B. Irie,et al.  Capabilities of three-layered perceptrons , 1988, IEEE 1988 International Conference on Neural Networks.

[10]  C. Chui,et al.  Approximation by ridge functions and neural networks with one hidden layer , 1992 .

[11]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[12]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[13]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[14]  H. Gish,et al.  A probabilistic approach to the understanding and training of neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[15]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[16]  Partha Niyogi,et al.  The informational complexity of learning from examples , 1996 .

[17]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[18]  C. Micchelli,et al.  Approximation by superposition of sigmoidal and radial basis functions , 1992 .

[19]  Nira Dyn,et al.  Interpolation of scattered Data by radial Functions , 1987, Topics in Multivariate Approximation.

[20]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[21]  R. L. Hardy Multiquadric equations of topography and other irregular surfaces , 1971 .

[22]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[23]  Federico Girosi,et al.  Regularization Theory, Radial Basis Functions and Networks , 1994 .

[24]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[25]  Adam Krzyzak,et al.  The rates of convergence of kernel regression estimates and classification rules , 1986, IEEE Trans. Inf. Theory.

[26]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[27]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[28]  G. Lorentz METRIC ENTROPY, WIDTHS, AND SUPERPOSITIONS OF FUNCTIONS , 1962 .

[29]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[30]  U. Grenander On empirical spectral analysis of stochastic processes , 1952 .

[31]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[32]  L. Devroye On the Almost Everywhere Convergence of Nonparametric Regression Function Estimates , 1981 .

[33]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[34]  Robert M. Farber,et al.  How Neural Nets Work , 1987, NIPS.

[35]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[36]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[37]  E. Stein Singular Integrals and Di?erentiability Properties of Functions , 1971 .

[38]  Barak A. Pearlmutter,et al.  Equivalence Proofs for Multi-Layer Perceptron Classifiers and the Bayesian Discriminant Function , 1991 .

[39]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[40]  G. Lorentz Approximation of Functions , 1966 .

[41]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[42]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[43]  G. Wahba Spline models for observational data , 1990 .

[44]  D. Pollard Convergence of stochastic processes , 1984 .

[45]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[46]  Christopher G. Atkeson,et al.  Generalization Properties of Radial Basis Functions , 1990, NIPS.

[47]  Jun-Ho Oh,et al.  Hybrid Learning of Mapping and its Jacobian in Multilayer Neural Networks , 1996, Neural Computation.

[48]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[49]  Partha Niyogi,et al.  Free to Choose: Investigating the Sample Complexity of Active Learning of Real Valued Functions , 1995, ICML.

[50]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[51]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[52]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[53]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[54]  R. Dudley Universal Donsker Classes and Metric Entropy , 1987 .

[55]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[56]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[57]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, COLT '91.

[58]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[59]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[60]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[61]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[62]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[63]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[64]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[65]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[66]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[67]  J. Andrew Ware,et al.  Layered Neural Networks as Universal Approximators , 1997, Fuzzy Days.

[68]  Z. Govindarajulu,et al.  Sequential Statistical Procedures , 1975 .

[69]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[70]  Sandro Ridella,et al.  Circular backpropagation networks for classification , 1997, IEEE Trans. Neural Networks.

[71]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[72]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .