On the Rate of Convergence of Regularized Boosting Classifiers

A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite a large class of distributions, the probability of error converges to the Bayes risk at a rate faster than n-(V+2)/(4(V+1)) where V is the VC dimension of the "base" class whose elements are combined by boosting methods to obtain an aggregated classifier. The dimension-independent nature of the rates may partially explain the good behavior of these methods in practical problems. Under Tsybakov's noise condition the rate of convergence is even faster. We investigate the conditions necessary to obtain such rates for different base classes. The special case of boosting using decision stumps is studied in detail. We characterize the class of classifiers realizable by aggregating decision stumps. It is shown that some versions of boosting work especially well in high-dimensional logistic additive models. It appears that adding a limited labelling noise to the training data may in certain cases improve the convergence, as has been also suggested by other authors.

[1]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[2]  J. Hoffmann-jorgensen Probability in Banach Space , 1977 .

[3]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[4]  J. Kuelbs Probability on Banach spaces , 1978 .

[5]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[8]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[9]  Eduardo D. Sontag,et al.  Feedback Stabilization Using Two-Hidden-Layer Nets , 1991, 1991 American Control Conference.

[10]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[11]  Richard J. Mammone,et al.  Artificial neural networks for speech and vision , 1994 .

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[14]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[15]  C. Darken,et al.  Constructive Approximation Rates of Convex Approximation in Non-hilbert Spaces , 2022 .

[16]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[17]  L. Breiman Arcing Classifiers , 1998 .

[18]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[19]  Yuhong Yang,et al.  Minimax Nonparametric Classification—Part I: Rates of Convergence , 1998 .

[20]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[21]  R. Meir,et al.  On the Approximation of Functional Classes Equipped with a Uniform Measure Using Ridge Functions , 1999 .

[22]  David P. Helmbold,et al.  Potential Boosters? , 1999, NIPS.

[23]  Yuhong Yang,et al.  Minimax Nonparametric Classification — Part II : Model Selection for Adaptation , 1998 .

[24]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[25]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[26]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[27]  Ron Meir,et al.  On the optimality of neural-network approximation using incremental algorithms , 2000, IEEE Trans. Neural Networks Learn. Syst..

[28]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[29]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[30]  B. Yu,et al.  Boosting with the L 2-loss regression and classification , 2001 .

[31]  Robert H. Sloan,et al.  Proceedings of the 15th Annual Conference on Computational Learning Theory , 2002 .

[32]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[33]  Shahar Mendelson,et al.  Improving the sample complexity using global data , 2002, IEEE Trans. Inf. Theory.

[34]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[35]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[36]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[37]  G. Lugosi,et al.  A note on the richness of convex hulls of VC classes , 2003 .

[38]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[39]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[40]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[41]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[42]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[43]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[44]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[45]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[46]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[47]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.