Sample Based GeneralizationBounds 1

It is known that the covering numbers of a function class on a double sample (length 2m, where m is the number of points in the sample) can be used to bound the generalization performance of a classiier by using a margin based analysis. Traditionally this has been done using a \Sauer-like" relationship involving a combinatorial dimension such as the fat-shattering dimension. In this paper we show that one can utilize an analogous argument in terms of the observed covering numbers on a single m-sample (being the actual observed data points). The signiicance of this is that for certain interesting classes of functions, such as support vector machines, one can readily estimate the empirical covering numbers quite well. We show how to do so in terms of the eigenvalues of the Gram matrix created from the data. These covering numbers can be much less than a priori bounds indicate in situations where the particular data received is \easy". The work can be considered an extension of previous results which provided generalization performance bounds in terms of the VC-dimension of the class of hypotheses restricted to the sample, with the considerable advantage that the covering numbers can be readily computed, and they often are small.

[1]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[2]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[3]  Gábor Lugosi,et al.  A data-dependent skeleton estimate for learning , 1996, COLT '96.

[4]  P. Bartlett,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[5]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  Vladimir Cherkassky,et al.  Model complexity control for regression using VC generalization bounds , 1999, IEEE Trans. Neural Networks.

[8]  B. Carl,et al.  Entropy, Compactness and the Approximation of Operators , 1990 .

[9]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[10]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[11]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[14]  Bernhard Schölkopf,et al.  Entropy Numbers, Operators and Support Vector Kernels , 1999, EuroCOLT.

[15]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[18]  L. Goddard Information Theory , 1962, Nature.

[19]  C. Schütt Entropy numbers of diagonal operators between symmetric Banach spaces , 1984 .

[20]  G. Reuter LINEAR OPERATORS PART II (SPECTRAL THEORY) , 1969 .