PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification

We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk.We provide bounds on the prediction error of classifiers returned by mistake-driven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample.Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.

[1]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[2]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[3]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[8]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[9]  Manfred K. Warmuth Compressing to VC Dimension Many Points , 2003, COLT.

[10]  John Shawe-Taylor,et al.  Generalisation Error Bounds for Sparse Linear Classifiers , 2000, COLT.

[11]  P. Bartlett,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[12]  Don R. Hush,et al.  Machine Learning with Data Dependent Hypothesis Classes , 2002, J. Mach. Learn. Res..

[13]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[14]  John Shawe-Taylor,et al.  PAC Bayes and Margins , 2003 .

[15]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[16]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[17]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[18]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[19]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[20]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[21]  Thore Graepel,et al.  A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work , 2000, NIPS.

[22]  Sally Floyd,et al.  Sample compression, learnability, and the Vapnik-Chervonenkis dimension , 2004, Machine Learning.

[23]  Ming Li,et al.  On Prediction by Data Compression , 1997, ECML.

[24]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[25]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[26]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[27]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[28]  Aaron D. Wyner,et al.  On the Role of Pattern Matching in Information Theory , 1998, IEEE Trans. Inf. Theory.

[29]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[30]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[31]  John Shawe-Taylor,et al.  Learning with the Set Covering Machine , 2001, ICML.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[34]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.