PAC-Bayesian Stochastic Model Selection

PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a “posterior distribution” on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.

[1]  Nathan Linial,et al.  Results on learnability and the Vapnik-Chervonenkis dimension , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[2]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[3]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[4]  Kenji Yamanishi Learning Non-parametric Densities in terms of Finite-Dimensional Parametric Hypotheses , 1992 .

[5]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[8]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[9]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[10]  Gábor Lugosi,et al.  Concept learning using complexity regularization , 1995, IEEE Trans. Inf. Theory.

[11]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[12]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[13]  Yishay Mansour,et al.  A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization , 1998, ICML.

[14]  Yuhong Yang Adaptive estimation in pattern recognition by combining different procedures , 1998 .

[15]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[16]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[17]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[18]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[19]  Yoram Singer,et al.  An Efficient Extension to Mixture Techniques for Prediction and Decision Trees , 1997, COLT '97.

[20]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.