Generalization bounds for averaged classifiers

We study a simple learning algorithm for binary classification. Instead of predicting with the best hypothesis in the hypothesis class, that is, the hypothesis that minimizes the training error, our algorithm predicts with a weighted average of all hypotheses, weighted exponentially with respect to their training error. We show that the prediction of this algorithm is much more stable than the prediction of an algorithm that predicts with the best hypothesis. By allowing the algorithm to abstain from predicting on some examples, we show that the predictions it makes when it does not abstain are very reliable. Finally, we show that the probability that the algorithm abstains is comparable to the generalization error of the best hypothesis in the class.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Temple F. Smith Occam's razor , 1980, Nature.

[3]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[4]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[5]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[6]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[7]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[8]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[9]  Robert E. Schapire,et al.  Predicting Nearly as Well as the Best Pruning of a Decision Tree , 1995, COLT.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[12]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[13]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[14]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[15]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[16]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[17]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[20]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[21]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[22]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[23]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[26]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.