Paper No . 171 A unified framework for Regularization Networks and Support Vector Machines

Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular the regression problem of approximating a multivariate function from sparse data. We present both formulations in a unified framework, namely in the context of Vapnik’s theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. Copyright c © Massachusetts Institute of Technology, 1998 This report describers research done at the Center for Biological & Computational Learning and the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. This research was sponsored by the National Science Foundation under contract No. IIS-9800032, the Office of Naval Research under contract No. N0001493-1-0385 and contract No. N00014-95-1-0600. Partial support was also provided by Daimler-Benz AG, Eastman Kodak, Siemens Corporate Research, Inc., ATR and AT&T.

[1]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[2]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[3]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[4]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[6]  Erkki Oja,et al.  The nonlinear PCA learning rule in independent component analysis , 1997, Neurocomputing.

[7]  Christophe Rabut,et al.  How to Build Quasi-Interpolants: Application to Polyharmonic B-Splines , 1991, Curves and Surfaces.

[8]  Tomaso A. Poggio,et al.  A Sparse Representation for Function Approximation , 1998, Neural Computation.

[9]  Massimiliano Pontil,et al.  A Note on Support Vector Machine Degeneracy , 1999, ALT.

[10]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[11]  Massimiliano Pontil,et al.  On the Noise Model of Support Vector Machines Regression , 2000, ALT.

[12]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[13]  V. Hutson Integral Equations , 1967, Nature.

[14]  F. Girosi Models of Noise and Robust Estimates , 1991 .

[15]  D. Donoho,et al.  Basis pursuit , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[16]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  C. Rabut AN INTRODUCTION TO SCHOENBERG'S APPROXIMATION , 1992 .

[19]  R. Dudley A course on empirical processes , 1984 .

[20]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[21]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[22]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[23]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[24]  M. Buhmann Multivariate cardinal interpolation with radial-basis functions , 1990 .

[25]  N. Cristianini,et al.  Robust Bounds on Generalization from the Margin Distribution , 1998 .

[26]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[27]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[28]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[29]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[30]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[31]  B. Olshausen Learning linear, sparse, factorial codes , 1996 .

[32]  Massimiliano Pontil,et al.  A Note on the Generalization Performance of Kernel Classifiers with Margin , 2000, ALT.

[33]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[34]  L. Goddard Approximation of Functions , 1965, Nature.

[35]  I. J. Schoenberg Contributions to the problem of approximation of equidistant data by analytic functions. Part A. On the problem of smoothing or graduation. A first class of analytic approximation formulae , 1946 .

[36]  Dr. M. G. Worster Methods of Mathematical Physics , 1947, Nature.

[37]  S. Rippa,et al.  Numerical Procedures for Surface Fitting of Scattered Data by Radial Functions , 1986 .

[38]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[39]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[40]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[41]  Yann LeCun,et al.  Efficient Pattern Recognition Using a New Transformation Distance , 1992, NIPS.

[42]  V. Ivanov,et al.  The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integr , 1978 .

[43]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[44]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[45]  Massimiliano Pontil,et al.  From regression to classification in support vector machines , 1999, ESANN.

[46]  I. J. Schoenberg,et al.  Cardinal interpolation and spline functions , 1969 .

[47]  Tomaso Poggio,et al.  Probabilistic Solution of Ill-Posed Problems in Computational Vision , 1987 .

[48]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[49]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[50]  R. DeVore,et al.  Nonlinear approximation , 1998, Acta Numerica.

[51]  M. Buhmann On quasi-interpolation with radial basis functions , 1993 .

[52]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[53]  L. Schumaker Spline Functions: Basic Theory , 1981 .

[54]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[55]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[56]  M. Bertero Regularization methods for linear inverse problems , 1986 .

[57]  H. Mhaskar Neural networks for localized approximation of real functions , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[58]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[59]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[60]  Olivier Chapelle,et al.  Model Selection for Support Vector Machines , 1999, NIPS.

[61]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[62]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[63]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[64]  W. Madych,et al.  Polyharmonic cardinal splines: a minimization property , 1990 .

[65]  C. D. Boor,et al.  Quasiinterpolants and Approximation Power of Multivariate Splines , 1990 .

[66]  A. Ron,et al.  On multivariate approximation by integer translates of a basis function , 1992 .

[67]  R W Prager,et al.  Development of low entropy coding in a recurrent network. , 1996, Network.

[68]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[69]  J. A. Cochran The analysis of linear integral equations , 1973 .

[70]  R. Dudley,et al.  Uniform and universal Glivenko-Cantelli classes , 1991 .

[71]  Terrence J. Sejnowski,et al.  Learning Nonlinear Overcomplete Representations for Efficient Coding , 1997, NIPS.

[72]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[73]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[74]  Tomaso A. Poggio,et al.  Bounds on the Generalization Performance of Kernel Machine Ensembles , 2000, ICML.

[75]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[76]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[77]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .