On weak base hypotheses and their implications for boosting regression and classification

When studying the training error and the prediction error for boosting, it is often assumed that the hypotheses returned by the base learner are weakly accurate, or are able to beat a random guesser by a certain amount of difference. It has been an open question how much this difference can be, whether it will eventually disappear in the boosting process or be bounded by a positive amount. This question is crucial for the behavior of both the training error and the prediction error. In this paper we study this problem and show affirmatively that the amount of improvement over the random guesser will be at least a positive amount for almost all possible sample realizations and for most of the commonly used base hypotheses. This has a number of implications for the prediction error, including, for example, that boosting forever may not be good and regularization may be necessary. The problem is studied by first considering an analog of AdaBoost in regression, where we study similar properties and find that, for good performance, one cannot hope to avoid regularization by just adopting the boosting device to regression.

[1]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[2]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[3]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[4]  Martin Anthony,et al.  Computational learning theory: an introduction , 1992 .

[5]  Alexander A. Razborov,et al.  Majority gates vs. general weighted threshold gates , 1992, [1992] Proceedings of the Seventh Annual Structure in Complexity Theory Conference.

[6]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[7]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[8]  D. L. Donoho,et al.  Ideal spacial adaptation via wavelet shrinkage , 1994 .

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[12]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[13]  L. Breiman Arcing the edge , 1997 .

[14]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[17]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[18]  L. Breiman USING ADAPTIVE BAGGING TO DEBIAS REGRESSIONS , 1999 .

[19]  Robert E. Schapire,et al.  Theoretical Views of Boosting , 1999, EuroCOLT.

[20]  Yuhong Yang,et al.  Minimax Nonparametric Classification—Part I: Rates of Convergence , 1998 .

[21]  P. Bartlett,et al.  Boosting Algorithms as Gradient Descent in Function , 1999 .

[22]  SwitzerlandBin YuBell Explaining Bagging , 2000 .

[23]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[24]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[25]  J. Friedman Stochastic gradient boosting , 2002 .

[26]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[27]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.