PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

We propose a PAC-Bayes theorem for the sample-compression setting where each classifier is described by a compression subset of the training data and a message string of additional information. This setting, which is the appropriate one to describe many learning algorithms, strictly generalizes the usual data-independent setting where classifiers are represented only by data-independent message strings (or parameters taken from a continuous set). The proposed PAC-Bayes theorem for the sample-compression setting reduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005) when the compression subset of each classifier vanishes. For posteriors having all their weights on a single sample-compressed classifier, the general risk bound reduces to a bound similar to the tight sample-compression bound proposed in Laviolette et al. (2005). Finally, we extend our results to the case where each sample-compressed classifier of a data-dependent ensemble may abstain of predicting a class label.

[1]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[2]  R. Rivest Learning Decision Lists , 1987, Machine Learning.

[3]  Colin Campbell,et al.  Bayes Point Machines , 2001, J. Mach. Learn. Res..

[4]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Thomas Hofmann,et al.  PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier , 2007 .

[7]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[8]  François Laviolette,et al.  A PAC-Bayes approach to the Set Covering Machine , 2005, NIPS.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[10]  Sally Floyd,et al.  Sample compression, learnability, and the Vapnik-Chervonenkis dimension , 2004, Machine Learning.

[11]  John Shawe-Taylor,et al.  PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification , 2005, Machine Learning.

[12]  Mario Marchand,et al.  Learning with Decision Lists of Data-Dependent Features , 2005, J. Mach. Learn. Res..

[13]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[14]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[15]  O. Catoni A PAC-Bayesian approach to adaptive classification , 2004 .

[16]  D. L. Reilly,et al.  A neural model for category learning , 1982, Biological Cybernetics.

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  François Laviolette,et al.  Margin-Sparsity Trade-Off for the Set Covering Machine , 2005, ECML.

[20]  François Laviolette,et al.  A PAC-Bayes Risk Bound for General Loss Functions , 2006, NIPS.

[21]  John Shawe-Taylor,et al.  The Set Covering Machine , 2003, J. Mach. Learn. Res..

[22]  Simon Haykin,et al.  An Approach to Adaptive Classification , 2001 .

[23]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[24]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[25]  David G. Stork,et al.  Pattern Classification , 1973 .

[26]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[27]  François Laviolette,et al.  PAC-Bayes risk bounds for sample-compressed Gibbs classifiers , 2005, ICML '05.

[28]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.