On the doubt about margin explanation of boosting

Margin theory provides one of the most popular explanations to the success of AdaBoost, where the central point lies in the recognition that margin is the key for characterizing the performance of AdaBoost. This theory has been very influential, e.g., it has been used to argue that AdaBoost usually does not overfit since it tends to enlarge the margin even after the training error reaches zero. Previously the minimum margin bound was established for AdaBoost, however, Breiman (1999) [9] pointed out that maximizing the minimum margin does not necessarily lead to a better generalization. Later, Reyzin and Schapire (2006) [37] emphasized that the margin distribution rather than minimum margin is crucial to the performance of AdaBoost. In this paper, we first present the kth margin bound and further study on its relationship to previous work such as the minimum margin bound and Emargin bound. Then, we improve the previous empirical Bernstein bounds (Audibert et al. 2009; Maurer and Pontil, 2009) [2,30], and based on such findings, we defend the margin-based explanation against Breiman@?s doubts by proving a new generalization error bound that considers exactly the same factors as Schapire et al. (1998) [39] but is sharper than Breiman@?s (1999) [9] minimum margin bound. By incorporating factors such as average margin and variance, we present a generalization error bound that is heavily related to the whole margin distribution. We also provide margin distribution bounds for generalization error of voting classifiers in finite VC-dimension space.

[1]  KohaviRon,et al.  An Empirical Comparison of Voting Classification Algorithms , 1999 .

[2]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[3]  C. McDiarmid Concentration , 1862, The Dental register.

[4]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[5]  Chunhua Shen,et al.  Boosting Through Optimization of Margin Distributions , 2009, IEEE Transactions on Neural Networks.

[6]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[9]  Andreas Maurer,et al.  Concentration inequalities for functions of independent variables , 2006, Random Struct. Algorithms.

[10]  Cynthia Rudin,et al.  The Rate of Convergence of Adaboost , 2011, COLT.

[11]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[12]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[13]  Zhi-Hua Zhou,et al.  A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin , 2011, J. Mach. Learn. Res..

[14]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[15]  Tony Jebara,et al.  Variance Penalizing AdaBoost , 2011, NIPS.

[16]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[19]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[20]  Dan Roth,et al.  Margin Distribution and Learning , 2003, ICML.

[21]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[22]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[23]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[24]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[25]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[26]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[27]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[28]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[29]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[30]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[31]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[32]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[33]  Tamás Linder,et al.  Data-dependent margin-based generalization bounds for classification , 2001, J. Mach. Learn. Res..

[34]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[35]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[36]  A E Bostwick,et al.  THE THEORY OF PROBABILITIES. , 1896, Science.

[37]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[38]  Andreas Maurer Concentration inequalities for functions of independent variables , 2006 .

[39]  W. Hoeffding Probability inequalities for sum of bounded random variables , 1963 .

[40]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[41]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[42]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[43]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[44]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[45]  Nello Cristianini,et al.  Generalization Performance of Classifiers in Terms of Observed Covering Numbers , 1999, EuroCOLT.

[46]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[47]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[48]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[49]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[50]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[51]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[52]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[53]  V. Koltchinskii,et al.  Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[54]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[55]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[56]  Darrel E. Bostow,et al.  An experimental comparison of three methods of instruction in health education for cancer prevention: traditional paper prose text, passive non-interactive computer presentation and overt-interactive computer presentation , 1992 .

[57]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.