On Different Facets of Regularization Theory

This review provides a comprehensive understanding of regularization theory from different perspectives, emphasizing smoothness and simplicity principles. Using the tools of operator theory and Fourier analysis, it is shown that the solution of the classical Tikhonov regularization problem can be derived from the regularized functional defined by a linear differential (integral) operator in the spatial (Fourier) domain. State-ofthe-art research relevant to the regularization theory is reviewed, covering Occam's razor, minimum length description, Bayesian theory, pruning algorithms, informational (entropy) theory, statistical learning theory, and equivalent regularization. The universal principle of regularization in terms of Kolmogorov complexity is discussed. Finally, some prospective studies on regularization theory and beyond are suggested.

[1]  R. Kanwal Linear Integral Equations , 1925, Nature.

[2]  R. Courant,et al.  Methods of Mathematical Physics , 1962 .

[3]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[4]  R. Courant,et al.  Methods of Mathematical Physics, Vol. I , 1954 .

[5]  T. Teichmann,et al.  Harmonic Analysis and the Theory of Probability , 1957, The Mathematical Gazette.

[6]  E. Parzen An Approach to Time Series Analysis , 1961 .

[7]  J. Gillis,et al.  Linear Differential Operators , 1963 .

[8]  P. Goldbart,et al.  Linear differential operators , 1967 .

[9]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[10]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[11]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  A. Balakrishnan Applied Functional Analysis , 1976 .

[14]  Jean Duchon,et al.  Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[15]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[16]  Ilya Prigogine,et al.  From Being To Becoming , 1980 .

[17]  Satosi Watanabe,et al.  Pattern recognition as a quest for minimum entropy , 1981, Pattern Recognit..

[18]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[20]  V. A. Morozov,et al.  Methods for Solving Incorrectly Posed Problems , 1984 .

[21]  T. Poggio,et al.  III-Posed problems early vision: from computational theory to analogue networks , 1985, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[22]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[23]  F. O’Sullivan A Statistical Perspective on Ill-posed Inverse Problems , 1986 .

[24]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[25]  Tomaso Poggio,et al.  Probabilistic Solution of Ill-Posed Problems in Computational Vision , 1987 .

[26]  P. Hansen Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion , 1987 .

[27]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[28]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[29]  Alan L. Yuille,et al.  A regularized solution to edge detection , 1985, J. Complex..

[30]  Alan L. Yuille,et al.  The Motion Coherence Theory , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[31]  Stuart German,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1988 .

[32]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[33]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[34]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[35]  R. Kress Linear Integral Equations , 1989 .

[36]  Dennis Sanger,et al.  Contribution analysis: a technique for assigning responsibilities to hidden units in connectionist networks , 1991 .

[37]  H. B. Barlow,et al.  Unsupervised Learning , 1989, Neural Computation.

[38]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[39]  J.G. Daugman,et al.  Entropy reduction and decorrelation in visual coding by oriented neural receptive fields , 1989, IEEE Transactions on Biomedical Engineering.

[40]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[41]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[42]  J. Herod Introduction to Hilbert spaces with applications , 1990 .

[43]  P. Mikusinski,et al.  Introduction to Hilbert spaces with applications , 1990 .

[44]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[45]  G. Wahba Spline models for observational data , 1990 .

[46]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[47]  Barak A. Pearlmutter,et al.  Chaitin-Kolmogorov Complexity and Generalization in Neural Networks , 1990, NIPS.

[48]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[49]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[50]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[51]  D. Rumelhart,et al.  Generalization by weight-elimination applied to currency exchange rate prediction , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[52]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[53]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[54]  Joseph J. Atick Entropy Minimization: a Design Principle for Sensory Perception? , 1992, Int. J. Neural Syst..

[55]  F. Girosi Some extensions of radial basis functions and their applications in artificial intelligence , 1992 .

[56]  Per Christian Hansen,et al.  Analysis of Discrete Ill-Posed Problems by Means of the L-Curve , 1992, SIAM Rev..

[57]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[58]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[59]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[60]  T. Zolezzi,et al.  Well-Posed Optimization Problems , 1993 .

[61]  Akira Namatame,et al.  A mathematical foundation on Poggio's regularization theory , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[62]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[63]  John E. Moody,et al.  Fast Pruning Using Principal Components , 1993, NIPS.

[64]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[65]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[66]  Christopher M. Bishop,et al.  Curvature-driven smoothing: a learning algorithm for feedforward networks , 1993, IEEE Trans. Neural Networks.

[67]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[68]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[69]  J. Urgen Schmidhuber Discovering Problem Solutions with Low Kolmogorov Complexity and High Generalization Capability , 1994 .

[70]  Adam Krzyzak,et al.  On radial basis function nets and kernel regression: Statistical consistency, convergence rates, and receptive field size , 1994, Neural Networks.

[71]  Federico Girosi,et al.  Regularization Theory, Radial Basis Functions and Networks , 1994 .

[72]  I-Chang Jou,et al.  Analysis of hidden nodes for multi-layer perceptron neural networks , 1994, Pattern Recognit..

[73]  R. Strichartz A Guide to Distribution Theory and Fourier Transforms , 1994 .

[74]  Bhavik R. Bakshi,et al.  Empirical Learning Through Neural Networks: The Wave-Net Solution , 1995 .

[75]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[76]  Gustavo Deco,et al.  Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks , 1995, Neural Computation.

[77]  Halbert White,et al.  Regularized Neural Networks: Some Convergence Rate Results , 1995, Neural Computation.

[78]  David Saad,et al.  Learning and Generalization in Radial Basis Function Networks , 1995, Neural Computation.

[79]  Todd K. Leen,et al.  From Data Distributions to Regularization in Invariant Learning , 1995, Neural Computation.

[80]  Partha Pratim Kanjilal,et al.  On the application of orthogonal transformation for the design and analysis of feedforward networks , 1995, IEEE Trans. Neural Networks.

[81]  Robert J. Marks,et al.  Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter , 1995, IEEE Trans. Neural Networks.

[82]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[83]  Yaser S. Abu-Mostafa,et al.  Hints , 2018, Neural Computation.

[84]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[85]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[86]  Kishan G. Mehrotra,et al.  Characterization of a Class of Sigmoid Functions with Applications to Neural Networks , 1996, Neural Networks.

[87]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[88]  Richard Rohwer,et al.  Minimum Description Length, Regularization, and Multimodal Data , 1996, Neural Computation.

[89]  John E. Moody,et al.  Smoothing Regularizers for Projective Basis Function Networks , 1996, NIPS.

[90]  Shree K. Nayar,et al.  Automatic generation of RBF networks using wavelets , 1996, Pattern Recognit..

[91]  Emile Fiesler,et al.  The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks , 1996, Neural Computation.

[92]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[93]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[94]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[95]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[96]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[97]  Huaiyu Zhu,et al.  No Free Lunch for Cross-Validation , 1996, Neural Computation.

[98]  Adam Krzyzak,et al.  Radial Basis Function Networks and Complexity Regularization in Function Learning , 2022 .

[99]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[100]  J. C. Lemm Prior Information and Generalized Questions , 1996 .

[101]  Lizhong Wu,et al.  A Smoothing Regularizer for Feedforward and Recurrent Neural Networks , 1996, Neural Computation.

[102]  R W Prager,et al.  Development of low entropy coding in a recurrent network. , 1996, Network.

[103]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[104]  Rudy Setiono,et al.  A Penalty-Function Approach for Pruning Feedforward Neural Networks , 1997, Neural Computation.

[105]  David H. Wolpert,et al.  On Bias Plus Variance , 1997, Neural Computation.

[106]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[107]  B. Birmingham Gaussian Regression and Optimal Finite Dimensional Linear Models , 1997 .

[108]  Cyril Goutte,et al.  Note on Free Lunches and Cross-Validation , 1997, Neural Computation.

[109]  Tor Arne Johansen,et al.  On Tikhonov regularization, bias and variance in nonlinear system identification , 1997, Autom..

[110]  Ryotaro Kamimura Information Controller to Maximize and Minimize Information , 1997, Neural Computation.

[111]  Tommy W. S. Chow,et al.  A novel noise robust fourth-order cumulants cost function , 1997, Neurocomputing.

[112]  Christopher K. I. Williams,et al.  Gaussian regression and optimal finite dimensional linear models , 1997 .

[113]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[114]  J. C. Lemm How to Implement A Priori Information: A Statistical Mechanics Approach , 1998, cond-mat/9808039.

[115]  Jürgen Schmidhuber,et al.  Source Separation as a By-Product of Regularization , 1998, NIPS.

[116]  Tomaso A. Poggio,et al.  A Sparse Representation for Function Approximation , 1998, Neural Computation.

[117]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[118]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[119]  Simon Haykin,et al.  Making sense of a complex world , 1998 .

[120]  D. Mackay,et al.  Introduction to Gaussian processes , 1998 .

[121]  Tom Heskes,et al.  Bias/Variance Decompositions for Likelihood-Based Estimators , 1998, Neural Computation.

[122]  V. Vapnik The Support Vector Method of Function Estimation , 1998 .

[123]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[124]  Vladimir Cherkassky,et al.  Learning from Data: Concepts, Theory, and Methods , 1998 .

[125]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[126]  F. Girosi,et al.  Sparse Correlation Kernel Analysis and Reconstruction , 1998 .

[127]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[128]  S. Haykin,et al.  Making sense of a complex world [chaotic events modeling] , 1998, IEEE Signal Process. Mag..

[129]  Christopher K. I. Williams Computation with Infinite Neural Networks , 1998, Neural Computation.

[130]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[131]  Van Paul Yee Regularized Radial Basis Function Networks: Theory and Applications to Probability Estimation, Classification, and Time Series Prediction , 1998 .

[132]  Tomaso Poggio,et al.  Incorporating prior information in machine learning by creating virtual examples , 1998, Proc. IEEE.

[133]  L. Breiman Bias-variance, regularization, instability and stabilization , 1998 .

[134]  Malik Magdon-Ismail,et al.  No Free Lunch for Early Stopping , 1999, Neural Computation.

[135]  Christopher J. C. Burges,et al.  Geometry and invariance in kernel based methods , 1999 .

[136]  Ch. Bernard,et al.  Wavelets and ill-posed problems : optic flow estimation and scattered data interpolation , 1999 .

[137]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[138]  G. Wahba Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV , 1999 .

[139]  Yoshio Takane,et al.  Discriminant Component Pruning: Regularization and Interpretation of Multilayered Backpropagation Networks , 1999, Neural Computation.

[140]  Tommy W. S. Chow,et al.  Adaptive Regularization Parameter Selection Method for Enhancing Generalization Capability of Neural Networks , 1999, Artif. Intell..

[141]  Federico Girosi,et al.  Generalization bounds for function approximation from scattered noisy data , 1999, Adv. Comput. Math..

[142]  E. O. Velipasaoglu,et al.  Spatial regularization of the electrocardiographic inverse problem and its application to endocardial mapping , 2000, IEEE Transactions on Biomedical Engineering.

[143]  Malik Magdon-Ismail,et al.  No Free Lunch for Noise Prediction , 2000, Neural Computation.

[144]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[145]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[146]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[147]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[148]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[149]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[150]  Tariq S. Durrani,et al.  A framework for multiscale and hybrid RKHS-based approximators , 2000, IEEE Trans. Signal Process..

[151]  Simon Haykin,et al.  Regularized radial basis functional networks: theory and applications , 2001 .

[152]  Junbin Gao,et al.  On a Class of Support Vector Kernels Based on Frames in Function Hilbert Spaces , 2001, Neural Computation.

[153]  Tommy W. S. Chow,et al.  Least third-order cumulant method with adaptive regularization parameter selection for neural networks , 2001, Artif. Intell..

[154]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[155]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[156]  Simon Haykin,et al.  A new view on regularization theory , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[157]  Robert A. Lordo,et al.  Learning from Data: Concepts, Theory, and Methods , 2001, Technometrics.

[158]  Giuseppe De Nicolao,et al.  Regularization networks: fast weight calculation via Kalman filtering , 2001, IEEE Trans. Neural Networks.

[159]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[160]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[161]  Spectral regularization and minEnt regularization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[162]  Lianfen Qian,et al.  Regularized Radial Basis Function Networks: Theory and Applications , 2002, Technometrics.

[163]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[164]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[165]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[166]  Christoph Schnörr,et al.  A nonlinear regularization approach to early vision , 1994, Biological Cybernetics.