An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstable methods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and significant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only “hard” areas but also outliers and noise.

[1]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[4]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[5]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[6]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[7]  Thomas G. Dietterich,et al.  Error-Correcting Output Codes: A General Method for Improving Multiclass Inductive Learning Programs , 1991, AAAI.

[8]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[9]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[10]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[13]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[14]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[15]  Jude W. Shavlik,et al.  Learning Symbolic Rules Using Artificial Neural Networks , 1993, ICML.

[16]  Alberto Maria Segre The Ninth International Conference on Machine Learning , 1993, AI Mag..

[17]  J. R. Quinlan,et al.  Comparing connectionist and symbolic learning methods , 1994, COLT 1994.

[18]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[19]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[20]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[21]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[22]  David H. Wolpert,et al.  The Relationship Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework , 1995 .

[23]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[24]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[25]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[27]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[28]  Salvatore J. Stolfo,et al.  Integrating multiple learned models for improving and scaling machine learning algorithms , 1996 .

[29]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[30]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[31]  M. Pazzani,et al.  Learning probabilistic relational concept descriptions , 1996 .

[32]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[33]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[34]  Pedro M. Domingos Why Does Bagging Work? A Bayesian Account and its Implications , 1997, KDD.

[35]  Russell Greiner,et al.  Computational learning theory and natural learning systems , 1997 .

[36]  Ron Kohavi,et al.  Option Decision Trees with Majority Votes , 1997, ICML.

[37]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[38]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[39]  L. Breiman Arcing the edge , 1997 .

[40]  Visualizing the Simple Bayesian Classi er , 1997 .

[41]  Ron Kohavi,et al.  Improving simple Bayes , 1997 .

[42]  Charles Elkan,et al.  Boosting and Naive Bayesian learning , 1997 .

[43]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[44]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[45]  Thomas Richardson,et al.  Interpretable Boosted Naïve Bayes Classification , 1998, KDD.

[46]  L. Breiman Arcing Classifiers , 1998 .

[47]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[48]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[49]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[50]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[51]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[52]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.