Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators

We derive new bounds for the generalization error of kernel machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs make use of a viewpoint that is apparently novel in the field of statistical learning theory. The hypothesis class is described in terms of a linear operator mapping from a possibly infinite-dimensional unit ball in feature space into a finite-dimensional space. The covering numbers of the class are then determined via the entropy numbers of the operator. These numbers, which characterize the degree of compactness of the operator can be bounded in terms of the eigenvalues of an integral operator induced by the kernel function used by the machine. As a consequence, we are able to theoretically explain the effect of the choice of kernel function on the generalization performance of support vector machines.

[1]  G. Watson Bessel Functions. (Scientific Books: A Treatise on the Theory of Bessel Functions) , 1923 .

[2]  L. Pontrjagin,et al.  Sur Une Propriete Metrique de la Dimension , 1932 .

[3]  L. Milne‐Thomson A Treatise on the Theory of Bessel Functions , 1945, Nature.

[4]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .

[5]  H. Widom Asymptotic behavior of the eigenvalues of certain integral equations , 1963 .

[6]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[7]  Nils J. Nilsson,et al.  Learning Machines: Foundations of Trainable Pattern-Classifying Systems , 1965 .

[8]  R. Prosser The ϵ-entropy and ϵ-capacity of certain time-varying channels , 1966 .

[9]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[10]  R. Prosser,et al.  The ε-entropy and ε-capacity of certain time-invariant channels , 1968 .

[11]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[12]  D. Jagerman $\varepsilon $-Entropy and Approximation of Bandlimited Functions , 1969 .

[13]  H. Triebel Interpolationseigenschaften von Entropie und Durchmesseridealen kompakter Operatoren , 1970 .

[14]  I. N. Sneddon The use of integral transforms , 1972 .

[15]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[16]  B. Carl Entropy numbers of diagonal operators with an application to eigenvalue problems , 1981 .

[17]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[18]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[19]  B. Carl Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces , 1985 .

[20]  A. Pietsch Eigenvalue distribution of compact operators , 1986 .

[21]  Shigeo Akashi An operator theoretical characterization of ε-entropy in Gaussian processes , 1986 .

[22]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[23]  M. Talagrand The Glivenko-Cantelli Problem , 1987 .

[24]  C. Schütt,et al.  Geometric and probabilistic estimates for entropy and approximation numbers of operators , 1987 .

[25]  A. Papoulis On Entropy Rate , 1987 .

[26]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[27]  Shigeo Akashi The asymptotic behavior of ε-entropy of a compact positive operator , 1990 .

[28]  B. Carl,et al.  Entropy, Compactness and the Approximation of Operators , 1990 .

[29]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[30]  V. I. Kolchinskii Entropic order of operators in Banach spaces and the central limit theorem , 1992 .

[31]  R. Schapire Toward Eecient Agnostic Learning , 1992 .

[32]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[33]  M. Junge,et al.  Some estimates on entropy numbers , 1993 .

[34]  M. Junge,et al.  Characterization of weak type by the entropy distribution of r-nuclear operators , 1993 .

[35]  J. Peetre,et al.  ϵ-Entropy, ϵ-Rate, and Interpolation Spaces Revisited with an Application to Linear Communication Channels , 1994 .

[36]  Martin Anthony,et al.  Probabilistic Analysis of Learning in Artificial Neural Networks: The PAC Model and its Variants , 1994 .

[37]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[38]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[39]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[40]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[41]  G. Lorentz,et al.  Constructive approximation : advanced problems , 1996 .

[42]  M. Talagrand The Glivenko-Cantelli problem, ten years later , 1996 .

[43]  Claus Müller Analysis of Spherical Symmetries in Euclidean Spaces , 1997 .

[44]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[45]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[46]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[47]  P. Bartlett,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[48]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[49]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[50]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[51]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[52]  Sanjeev R. Kulkarni,et al.  Learning Pattern Classification - A Survey , 1998, IEEE Trans. Inf. Theory.

[53]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[54]  R. C. Williamson,et al.  Generalization Bounds via Eigenvalues of the Gram matrix , 1999 .

[55]  Nello Cristianini,et al.  Generalization Performance of Classifiers in Terms of Observed Covering Numbers , 1999, EuroCOLT.

[56]  John Shawe-Taylor,et al.  Covering numbers for support vector machines , 1999, COLT '99.

[57]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[58]  Bernhard Schölkopf,et al.  Entropy Numbers of Linear Function Classes , 2000, COLT.

[59]  Smola,et al.  Entropy Numbers for Convex Combinations and MLPs , 2000 .

[60]  Bernhard Schölkopf,et al.  Regularized Principal Manifolds , 1999, J. Mach. Learn. Res..

[61]  Leonid Gurvits A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces , 2001, Theor. Comput. Sci..

[62]  G. Shimura The representation of integers as sums of squares , 2002 .

[63]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.