An Introduction to Boosting and Leveraging

We provide an introduction to theoretical and practical aspects of Boosting and Ensemble learning, providing a useful reference for researchers in the field of Boosting as well as for those seeking to enter this fascinating area of research. We begin with a short background concerning the necessary learning theoretical foundations of weak learners and their linear combinations. We then point out the useful connection between Boosting and the Theory of Optimization, which facilitates the understanding of Boosting and later on enables us to move on to new Boosting algorithms, applicable to a broad spectrum of problems. In order to increase the relevance of the paper to practitioners, we have added remarks, pseudo code, "tricks of the trade", and algorithmic considerations where appropriate. Finally, we illustrate the usefulness of Boosting algorithms by giving an overview of some existing applications. The main ideas are illustrated on the problem of binary classification, although several extensions are discussed.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[3]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[4]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[7]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[8]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[9]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[10]  E. C. Macrae Matrix Derivatives with an Application to an Adaptive Linear Decision Problem , 1974 .

[11]  H. Akaike A new look at the statistical model identification , 1974 .

[12]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[13]  Franco P. Preparata,et al.  The Densest Hemisphere Problem , 1978, Theor. Comput. Sci..

[14]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[15]  Temple F. Smith Occam's razor , 1980, Nature.

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[18]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[19]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[20]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[21]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[22]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[23]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[24]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[25]  G. W. Hart,et al.  Nonintrusive appliance load monitoring , 1992, Proc. IEEE.

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[28]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[29]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[30]  P. Tseng,et al.  On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[31]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[32]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[33]  Harris Drucker,et al.  Boosting Performance in Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[34]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[35]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[36]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[37]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[38]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[39]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[40]  Manfred K. Warmuth,et al.  Bounds on approximate steepest descent for likelihood maximization in exponential families , 1994, IEEE Trans. Inf. Theory.

[41]  J. Dussault,et al.  Stable exponential-penalty algorithm with superlinear convergence , 1994 .

[42]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[43]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[44]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[45]  O. Mangasarian,et al.  Multicategory discrimination via linear programming , 1994 .

[46]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[47]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[48]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[49]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[50]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[51]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[52]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[53]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[54]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[55]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[56]  C. Robert The Bayesian choice : a decision-theoretic motivation , 1996 .

[57]  J. Ross Quinlan,et al.  Boosting First-Order Learning , 1996, ALT.

[58]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[59]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[60]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[61]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[62]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[63]  Paola Campadelli,et al.  A Boosting Algorithm for Regression , 1997, ICANN.

[64]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[65]  Harris Drucker,et al.  Improving Regressors using Boosting Techniques , 1997, ICML.

[66]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[67]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[68]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[69]  Robert E. Schapire,et al.  Using output codes to boost multiclass learning problems , 1997, ICML.

[70]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[71]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[72]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[73]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[74]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[75]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[76]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[77]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[78]  K. Kiwiel Relaxation Methods for Strictly Convex Regularizations of Piecewise Linear Programs , 1998 .

[79]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[80]  Gunnar Rätsch,et al.  An asymptotic analysis of AdaBoost in the binary classification case , 1998 .

[81]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[82]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[83]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[84]  Olvi L. Mangasarian,et al.  Arbitrary-norm separating plane , 1999, Oper. Res. Lett..

[85]  Yoram Singer,et al.  Boosting Applied to Tagging and PP Attachment , 1999, EMNLP.

[86]  Nello Cristianini,et al.  Further results on the margin distribution , 1999, COLT '99.

[87]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[88]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[89]  Venkatesan Guruswami,et al.  Multiclass learning, boosting, and error-correcting codes , 1999, COLT '99.

[90]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[91]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[92]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[93]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT '99.

[94]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[95]  David P. Helmbold,et al.  Potential Boosters? , 1999, NIPS.

[96]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[97]  Thomas Richardson,et al.  Boosting methodology for regression problems , 1999, AISTATS.

[98]  Yoram Singer,et al.  Leveraged Vector Machines , 1999, NIPS.

[99]  David P. Helmbold,et al.  A geometric approach to leveraging weak learners , 1999, Theor. Comput. Sci..

[100]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[101]  Llew Mason,et al.  Margins and combined classifiers , 1999 .

[102]  Tsuhan Chen,et al.  Pose invariant face recognition , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[103]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[104]  Gunnar Rätsch,et al.  Barrier Boosting , 2000, COLT.

[105]  Javed A. Aslam Improving Algorithms for Boosting , 2000, COLT.

[106]  Gunnar Rätsch,et al.  Robust Ensemble Learning , 2000 .

[107]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[108]  Ran El-Yaniv,et al.  Localized Boosting , 2000, COLT.

[109]  J. Langford,et al.  FeatureBoost: A Meta-Learning Algorithm that Improves Model Robustness , 2000, ICML.

[110]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[111]  John Shawe-Taylor,et al.  Towards a strategy for boosting regressors , 2000 .

[112]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[113]  John Shawe-Taylor,et al.  Sparsity vs. Large Margins for Linear Classifiers , 2000, COLT.

[114]  Toniann Pitassi,et al.  A Gradient-Based Boosting Algorithm for Regression Problems , 2000, NIPS.

[115]  Lluís Màrquez i Villodre,et al.  Boosting Applied to Word Sense Disambiguation , 2000, ArXiv.

[116]  Lluís Màrquez i Villodre,et al.  Boosting Applied toe Word Sense Disambiguation , 2000, ECML.

[117]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[118]  Colin Campbell,et al.  A Linear Programming Approach to Novelty Detection , 2000, NIPS.

[119]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[120]  Yoram Singer,et al.  Boosting for document routing , 2000, CIKM '00.

[121]  Yoshua Bengio,et al.  Boosting Neural Networks , 2000, Neural Computation.

[122]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[123]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[124]  David P. Helmbold,et al.  Leveraging for Regression , 2000, COLT.

[125]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[126]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[127]  Gunnar Rätsch Robustes Boosting durch konvexe Optimierung , 2001, Ausgezeichnete Informatikdissertationen.

[128]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[129]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[130]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[131]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[132]  Yves Grandvalet Bagging Can Stabilize without Reducing Variance , 2001, ICANN.

[133]  Tong Zhang,et al.  A General Greedy Approximation Algorithm with Applications , 2001, NIPS.

[134]  Chuan Long,et al.  Boosting Noisy Data , 2001, ICML.

[135]  Shai Ben-David,et al.  Agnostic Boosting , 2001, COLT/EuroCOLT.

[136]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[137]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[138]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[139]  Christophe Ambroise,et al.  Boosting Mixture Models for Semi-supervised Learning , 2001, ICANN.

[140]  Shie Mannor,et al.  Geometric Bounds for Generalization in Boosting , 2001, COLT/EuroCOLT.

[141]  Ralf Herbrich Learning linear classifiers: theory and algorithms , 2001 .

[142]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[143]  Wenxin Jiang,et al.  Some Theoretical Aspects of Boosting in the Presence of Noisy Data , 2001, ICML.

[144]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[145]  Tamás Linder,et al.  Data-dependent margin-based generalization bounds for classification , 2001, J. Mach. Learn. Res..

[146]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[147]  G. Rätsch Robust Boosting via Convex Optimization , 2001 .

[148]  Cesare Furlanello,et al.  Tuning Cost-Sensitive Boosting and Its Application to Melanoma Diagnosis , 2001, Multiple Classifier Systems.

[149]  Marilyn A. Walker,et al.  SPoT: A Trainable Sentence Planner , 2001, NAACL.

[150]  Klaus-Robert Müller,et al.  Subspace information criterion for nonquadratic regularizers-Model selection for sparse regressors , 2002, IEEE Trans. Neural Networks.

[151]  A Consistent Strategy for Boosting Algorithms , 2002, COLT.

[152]  Ayhan Demiriz,et al.  Exploiting unlabeled data in ensemble methods , 2002, KDD.

[153]  Marc Sebban,et al.  Boosting Density Function Estimators , 2002, ECML.

[154]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[155]  Srinivas Bangalore,et al.  Combining prior knowledge and boosting for call classification in spoken language dialogue , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[156]  Gunnar Rätsch,et al.  Adapting Codes and Embeddings for Polychotomies , 2002, NIPS.

[157]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[158]  Yu-Bin Yang,et al.  Lung cancer cell identification based on artificial neural network ensembles , 2002, Artif. Intell. Medicine.

[159]  Dmitry Gavinsky,et al.  On Boosting with Polynomially Bounded Distributions , 2002, J. Mach. Learn. Res..

[160]  J. Friedman Stochastic gradient boosting , 2002 .

[161]  Richard Nock,et al.  A Robust Boosting Algorithm , 2002, ECML.

[162]  Ran El-Yaniv,et al.  Variance Optimized Bagging , 2002, ECML.

[163]  Peter Stone,et al.  Modeling Auction Price Uncertainty Using Boosting-based Conditional Density Estimation , 2002, ICML.

[164]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[165]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[166]  Gunnar Rätsch,et al.  Maximizing the Margin with Boosting , 2002, COLT.

[167]  P. Bühlmann,et al.  How to use boosting for tumor classification with gene expression data , 2002 .

[168]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[169]  Nello Cristianini,et al.  On the generalization of soft margin algorithms , 2002, IEEE Trans. Inf. Theory.

[170]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[171]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[172]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[173]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[174]  P. Bühlmann,et al.  Volatility estimation with functional gradient descent for very high-dimensional financial time series , 2003 .

[175]  Rocco A. Servedio,et al.  Smooth Boosting and Learning with Malicious Noise , 2001, J. Mach. Learn. Res..

[176]  David G. Luenberger,et al.  Linear and Nonlinear Programming: Second Edition , 2003 .

[177]  Shie Mannor,et al.  On the Existence of Linear Weak Learners and Applications to Boosting , 2002, Machine Learning.

[178]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[179]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[180]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[181]  Rocco A. Servedio,et al.  PAC Analogues of Perceptron and Winnow Via Boosting the Margin , 2000, Machine Learning.

[182]  Tong Zhang,et al.  On the Dual Formulation of Regularized Linear Systems with Convex Risks , 2002, Machine Learning.

[183]  Satoshi Shirai,et al.  Using Decision Trees to Construct a Practical Parser , 1999, COLING.

[184]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[185]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[186]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[187]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[188]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[189]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[190]  David P. Helmbold,et al.  Boosting Methods for Regression , 2002, Machine Learning.

[191]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[192]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[193]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[194]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[195]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[196]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[197]  Tom,et al.  A simple cost function for boostingMarcus , .