Regularization Theory and Neural Networks

We had previously shown that regularization principles lead to approximation schemes which are equivalent to networks with one layer of hidden units, called Regularization Networks. In particular, standard smoothness functionals lead to a subclass of regularization networks, the well known Radial Basis Functions approximation schemes. This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we introduce new classes of smoothness functionals that lead to diierent classes of basis functions. Additive splines as well as some tensor product splines can be obtained from appropriate classes of smoothness functionals. Furthermore, the same generalization that extends Radial Basis Functions (RBF) to Hyper Basis Functions (HBF) also leads from additive models to ridge approximation models, containing as special cases Breiman's hinge functions, some forms of Projection Pursuit Regression and several types of neural networks. We propose to use the term Generalized Regularization Networks for this broad class of approximation schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the diierent classes of basis functions correspond to diierent classes of prior probabilities on the approximating function spaces, and therefore to diierent types of smoothness assumptions. In summary, diierent multilayer networks with one hidden layer, which we collectively call Generalized Regularization Networks, correspond to diierent classes of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are a) Radial Basis Functions that can be generalized to Hyper Basis Functions, b) some tensor product splines, and c) additive splines that can be generalized to schemes of the type of ridge approximation, hinge functions and several perceptron-like neural networks with one-hidden layer.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  R. Penrose A Generalized inverse for matrices , 1955 .

[3]  G. Lorentz METRIC ENTROPY, WIDTHS, AND SUPERPOSITIONS OF FUNCTIONS , 1962 .

[4]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[5]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[6]  E. Nadaraya On Estimating Regression , 1964 .

[7]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[8]  G. Lorentz Approximation of Functions , 1966 .

[9]  I. J. Schoenberg,et al.  Cardinal interpolation and spline functions , 1969 .

[10]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[11]  E. Stein Singular Integrals and Di?erentiability Properties of Functions , 1971 .

[12]  R. L. Hardy Multiquadric equations of topography and other irregular surfaces , 1971 .

[13]  R. N. Desmarais,et al.  Interpolation using surface splines. , 1972 .

[14]  M. Priestley,et al.  Non‐Parametric Function Fitting , 1972 .

[15]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[16]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[17]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[18]  Jean Duchon,et al.  Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[19]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[20]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.

[21]  C. R. Deboor,et al.  A practical guide to splines , 1978 .

[22]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[23]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[24]  F. Utreras Cross-validation techniques for smoothing spline functions in one or two dimensions , 1979 .

[25]  Grace Wahba Smoothing and Ill-Posed Problems , 1979 .

[26]  J. Meinguet Multivariate interpolation at arbitrary points made simple , 1979 .

[27]  L. Devroye,et al.  Distribution-Free Consistency Results in Nonparametric Discrimination and Regression Function Estimation , 1980 .

[28]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[29]  L. Schumaker Spline Functions: Basic Theory , 1981 .

[30]  R. Franke Scattered data interpolation: tests of some methods , 1982 .

[31]  W E Grimson,et al.  A computational theory of visual surface interpolation. , 1982, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[32]  D. Pollard Convergence of stochastic processes , 1984 .

[33]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[34]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[35]  D. Cox MULTIVARIATE SMOOTHING SPLINE FUNCTIONS , 1984 .

[36]  H. Müller,et al.  Estimating regression functions and their derivatives by the kernel method , 1984 .

[37]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[38]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[39]  A. Pinkus n-Widths in Approximation Theory , 1985 .

[40]  C. Atkinson METHODS FOR SOLVING INCORRECTLY POSED PROBLEMS , 1985 .

[41]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[42]  S. Rippa,et al.  Numerical Procedures for Surface Fitting of Scattered Data by Radial Functions , 1986 .

[43]  Dana H. Ballard,et al.  Cortical connections and parallel processing: Structure and function , 1986, Behavioral and Brain Sciences.

[44]  P. Lancaster Curve and surface fitting , 1986 .

[45]  M. Bertero Regularization methods for linear inverse problems , 1986 .

[46]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[47]  Bartlett W. Mel MURPHY: A Robot that Learns by Doing , 1987, NIPS.

[48]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[49]  Robert M. Farber,et al.  How Neural Nets Work , 1987, NIPS.

[50]  Nira Dyn,et al.  Interpolation of scattered Data by radial Functions , 1987, Topics in Multivariate Approximation.

[51]  R. Tibshirani,et al.  Generalized Additive Models: Some Applications , 1987 .

[52]  Tomaso Poggio,et al.  Probabilistic Solution of Ill-Posed Problems in Computational Vision , 1987 .

[53]  Richard Franke,et al.  Recent Advances in the Approximation of surfaces from scattered Data , 1987, Topics in Multivariate Approximation.

[54]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[55]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[56]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[57]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[58]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[59]  Alan L. Yuille,et al.  A regularized solution to edge detection , 1985, J. Complex..

[60]  Alan L. Yuille,et al.  The Motion Coherence Theory , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[61]  T. Poggio,et al.  Synthesizing a color algorithm from examples. , 1988, Science.

[62]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[63]  B. Irie,et al.  Capabilities of three-layered perceptrons , 1988, IEEE 1988 International Conference on Neural Networks.

[64]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[65]  W. Madych,et al.  Multivariate interpolation and condi-tionally positive definite functions , 1988 .

[66]  I. J. Schoenberg Contributions to the Problem of Approximation of Equidistant Data by Analytic Functions , 1988 .

[67]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[68]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[69]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[70]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[71]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[72]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[73]  I. Johnstone,et al.  Projection-Based Approximation and a Duality with Kernel Methods , 1989 .

[74]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[75]  W. Ziemer Weakly differentiable functions , 1989 .

[76]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[77]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[78]  F. Girosi,et al.  A Nondeterministic Minimization Algorithm , 1990 .

[79]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[80]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[81]  M. Buhmann Multivariate cardinal interpolation with radial-basis functions , 1990 .

[82]  G. Wahba Spline models for observational data , 1990 .

[83]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[84]  F. Girosi,et al.  Extensions of a Theory of Networks and Learning: Outliers and Negative Examples , 1990 .

[85]  W. Madych,et al.  Polyharmonic cardinal splines: a minimization property , 1990 .

[86]  C. D. Boor,et al.  Quasiinterpolants and Approximation Power of Multivariate Splines , 1990 .

[87]  Richard P. Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[88]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[89]  E. J. Kansa,et al.  Multi-quadrics-a scattered data approximation scheme with applications to computational fluid dynamics-II , 1990 .

[90]  T. Poggio A theory of how the brain might work. , 1990, Cold Spring Harbor symposia on quantitative biology.

[91]  Christophe Rabut,et al.  How to Build Quasi-Interpolants: Application to Polyharmonic B-Splines , 1991, Curves and Surfaces.

[92]  F. Girosi Models of Noise and Robust Estimates , 1991 .

[93]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[94]  Norman Yarvin,et al.  Networks with Learned Unit Response Functions , 1991, NIPS.

[95]  R. P. Lippmann A critical overview of neural network pattern classifiers , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[96]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[97]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[98]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[99]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[100]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[101]  F. Girosi,et al.  Convergence Rates of Approximation by Translates , 1992 .

[102]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[103]  F. Girosi Some extensions of radial basis functions and their applications in artificial intelligence , 1992 .

[104]  Bartlett W. Mel NMDA-Based Pattern Discrimination in a Modeled Cortical Neuron , 1992, Neural Computation.

[105]  C. Rabut AN INTRODUCTION TO SCHOENBERG'S APPROXIMATION , 1992 .

[106]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[107]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[108]  Volker Tresp,et al.  Network Structuring and Training Using Rule-Based Knowledge , 1992, NIPS.

[109]  A. Ron,et al.  On multivariate approximation by integer translates of a basis function , 1992 .

[110]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[111]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[112]  Charles A. Micchelli,et al.  How to Choose an Activation Function , 1993, NIPS.

[113]  Tomaso Poggio,et al.  Observations on Cortical Mechanisms for Object Recognition and Learning , 1993 .

[114]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[115]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[116]  M. Buhmann On quasi-interpolation with radial basis functions , 1993 .

[117]  H. Mhaskar Neural networks for localized approximation of real functions , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[118]  Sun-Yuan Kung,et al.  Digital neural networks , 1993, Prentice Hall Information and System Sciences Series.

[119]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[120]  Federico Girosi,et al.  Regularization Theory, Radial Basis Functions and Networks , 1994 .

[121]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[122]  I. Omiaj,et al.  Extensions of a Theory of Networks for Approximation and Learning : dimensionality reduction and clustering , 2022 .