PAC-Bayesian inductive and transductive learning

We present here a PAC-Bayesian point of view on adaptive supervised classification. Using convex analysis, we show how to get local measures of the complexity of the classification model involving the relative entropy of posterior distributions with respect to Gibbs posterior measures. We discuss relative bounds, comparing two classification rules, to show how the margin assumption of Mammen and Tsybakov can be replaced with some empirical measure of the covariance structure of the classification model. We also show how to associate to any posterior distribution an {\em effective temperature} relating it to the Gibbs prior distribution with the same level of expected error rate, and how to estimate this effective temperature from data, resulting in an estimator whose expected error rate adaptively converges according to the best possible power of the sample size. Then we introduce a PAC-Bayesian point of view on transductive learning and use it to improve on known Vapnik's generalization bounds, extending them to the case when the sample is independent but not identically distributed. Eventually we review briefly the construction of Support Vector Machines and show how to derive generalization bounds for them, measuring the complexity either through the number of support vectors or through transductive or inductive margin estimates.

[1]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[2]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[3]  S. Geer,et al.  Square root penalty: Adaptation to the margin in classification and in edge estimation , 2005, math/0507422.

[4]  Jean-Yves Audibert Aggregated estimators and empirical complexity for least square regression , 2004 .

[5]  John Langford,et al.  Computable Shell Decomposition Bounds , 2000, J. Mach. Learn. Res..

[6]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[7]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[8]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[9]  Peter A Flach,et al.  Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop , 2003 .

[10]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[11]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[12]  Nello Cristianini,et al.  On the generalization of soft margin algorithms , 2002, IEEE Trans. Inf. Theory.

[13]  P. Massart,et al.  Gaussian model selection , 2001 .

[14]  Jean-Philippe Vert,et al.  Adaptive context trees and text clustering , 2001, IEEE Trans. Inf. Theory.

[15]  John Langford,et al.  An Improved Predictive Accuracy Bound for Averaging Classifiers , 2001, ICML.

[16]  Jean-Philippe Vert Text Categorization Using Adaptive Context Trees , 2001, CICLing.

[17]  O. Catoni Laplace transform estimates and deviation inequalities , 2001 .

[18]  Olivier Catoni,et al.  DATA COMPRESSION AND ADAPTIVE HISTOGRAMS , 2002 .

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  S. Geer Applications of empirical process theory , 2000 .

[21]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[22]  G. Blanchard The “progressive mixture” estimator for regression trees , 1999 .

[23]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[24]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[25]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[26]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[27]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[28]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[29]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[30]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[31]  Frans M. J. Willems,et al.  Context weighting for general finite-context sources , 1996, IEEE Trans. Inf. Theory.

[32]  Neri Merhav,et al.  Hierarchical universal coding , 1996, IEEE Trans. Inf. Theory.

[33]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[34]  Manfred K. Warmuth Proceedings of the seventh annual conference on Computational learning theory , 1994, COLT 1994.

[35]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[36]  A. Barron Are Bayes Rules Consistent in Information , 1987 .