To Boost or not to Boost: On the Limits of Boosted Neural Networks

Boosting is a method for finding a highly accurate hypothesis by linearly combining many “weak” hypotheses, each of which may be only moderately accurate. Thus, boosting is a method for learning an ensemble of classifiers. While boosting has been shown to be very effective for decision trees, its impact on neural networks has not been extensively studied. We prove one important difference between sums of decision trees compared to sums of convolutional neural networks (CNNs) which is that a sum of decision trees cannot be represented by a single decision tree with the same number of parameters while a sum of CNNs can be represented by a single CNN. Next, using standard object recognition datasets, we verify experimentally the well-known result that a boosted ensemble of decision trees usually generalizes much better on testing data than a single decision tree with the same number of parameters. In contrast, using the same datasets and boosting algorithms, our experiments show the opposite to be true when using neural networks (both CNNs and multilayer perceptrons (MLPs)). We find that a single neural network usually generalizes better than a boosted ensemble of smaller neural networks with the same total number of parameters.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  George D. Magoulas,et al.  Deep Incremental Boosting , 2016, GCAI.

[3]  Zhi-Hua Zhou,et al.  On the doubt about margin explanation of boosting , 2010, Artif. Intell..

[4]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Jonathan Brandt,et al.  Robust object detection via soft cascade , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Yoshua Bengio,et al.  Boosting Neural Networks , 2000, Neural Computation.

[10]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[11]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[12]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[13]  Zhiyuan He,et al.  Gradient Boosting Machine: A Survey , 2019, ArXiv.

[14]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[15]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[16]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  George D. Magoulas,et al.  Boosted Residual Networks , 2017, EANN.

[19]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Jian Yang,et al.  Boosted Convolutional Neural Networks , 2016, BMVC.

[22]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[23]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[24]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[25]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[28]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[29]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[30]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[31]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[32]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[34]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[35]  Stefan Babinec,et al.  Incremental Learning of Convolutional Neural Networks , 2009, IJCCI.

[36]  Michael Cogswell,et al.  Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks , 2015, ArXiv.

[37]  Nuno Vasconcelos,et al.  Multiclass Boosting: Theory and Algorithms , 2011, NIPS.

[38]  Yoshua Bengio,et al.  AdaBoosting Neural Networks: Application to on-line Character Recognition , 1997, ICANN.

[39]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[40]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .