Boosting as entropy projection

We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost’s choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane defined by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.

[1]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[2]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[3]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[4]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[5]  Charles L. Byrne,et al.  General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis , 1990, IEEE Trans. Inf. Theory.

[6]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[7]  Guy Jumarie,et al.  Relative Information — What For? , 1990 .

[8]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[9]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[10]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[11]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[12]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[13]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[14]  Manfred K. Warmuth,et al.  Bounds on approximate steepest descent for likelihood maximization in exponential families , 1994, IEEE Trans. Inf. Theory.

[15]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[19]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[20]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Tom Bylander,et al.  The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.

[22]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[23]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[24]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[25]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[26]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[27]  David P. Helmbold,et al.  A geometric approach to leveraging weak learners , 1999, Theor. Comput. Sci..

[28]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[29]  G. Jumarie Relative Information: Theories and Applications , 2011 .