Boosting as a Regularized Path to a Maximum Margin Classifier

In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an l1 constraint on the coefficient vector. This helps understand the success of boosting with early stopping as regularized fitting of the loss criterion. For the two most commonly used criteria (exponential and binomial log-likelihood), we further show that as the constraint is relaxed---or equivalently as the boosting iterations proceed---the solution converges (in the separable case) to an "l1-optimal" separating hyper-plane. We prove that this l1-optimal separating hyper-plane has the property of maximizing the minimal l1-margin of the training data, as defined in the boosting literature. An interesting fundamental similarity between boosting and kernel support vector machines emerges, as both can be described as methods for regularized optimization in high-dimensional predictor space, using a computational trick to make the calculation practical, and converging to margin-maximizing solutions. While this statement describes SVMs exactly, it applies to boosting only approximately.

[1]  I. Johnstone,et al.  Wavelet Shrinkage: Asymptopia? , 1995 .

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[4]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[7]  Olvi L. Mangasarian,et al.  Arbitrary-norm separating plane , 1999, Oper. Res. Lett..

[8]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[9]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[10]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[11]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[12]  Saharon Rosset,et al.  Boosting Density Estimation , 2002, NIPS.

[13]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[14]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[15]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[16]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[17]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[18]  Gunnar Rätsch,et al.  Efficient Margin Maximizing with Boosting , 2005, J. Mach. Learn. Res..