SVM-Boosting based on Markov resampling: Theory and algorithm

In this article we introduce the idea of Markov resampling for Boosting methods. We first prove that Boosting algorithm with general convex loss function based on uniformly ergodic Markov chain (u.e.M.c.) examples is consistent and establish its fast convergence rate. We apply Boosting algorithm based on Markov resampling to Support Vector Machine (SVM), and introduce two new resampling-based Boosting algorithms: SVM-Boosting based on Markov resampling (SVM-BM) and improved SVM-Boosting based on Markov resampling (ISVM-BM). In contrast with SVM-BM, ISVM-BM uses the support vectors to calculate the weights of base classifiers. The numerical studies based on benchmark datasets show that the proposed two resampling-based SVM Boosting algorithms for linear base classifiers have smaller misclassification rates, less total time of sampling and training compared to three classical AdaBoost algorithms: Gentle AdaBoost, Real AdaBoost, Modest AdaBoost. In addition, we compare the proposed SVM-BM algorithm with the widely used and efficient gradient Boosting algorithm-XGBoost (eXtreme Gradient Boosting), SVM-AdaBoost and present some useful discussions on the technical parameters.

[1]  Yiming Ying,et al.  Learning Rates of Least-Square Regularized Regression , 2006, Found. Comput. Math..

[2]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[3]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[4]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[6]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[7]  John C. Duchi,et al.  The Generalization Ability of Online Algorithms for Dependent Data , 2011, IEEE Transactions on Information Theory.

[8]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[9]  Nuno Vasconcelos,et al.  Multiclass Boosting: Margins, Codewords, Losses, and Algorithms , 2019, J. Mach. Learn. Res..

[10]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[11]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[14]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[15]  Shao-Bo Lin,et al.  Boosted Kernel Ridge Regression: Optimal Learning Rates and Early Stopping , 2019, J. Mach. Learn. Res..

[16]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[17]  Robert E. Schapire,et al.  A theory of multiclass boosting , 2010, J. Mach. Learn. Res..

[18]  Felipe Cucker,et al.  Best Choices for Regularization Parameters in Learning Theory: On the Bias—Variance Problem , 2002, Found. Comput. Math..

[19]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[20]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[21]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[23]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[24]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[25]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[26]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[27]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[30]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[31]  Wenxin Jiang On weak base hypotheses and their implications for boosting regression and classification , 2002 .

[32]  Mathukumalli Vidyasagar,et al.  Learning and Generalization: With Applications to Neural Networks , 2002 .

[33]  Jie Xu,et al.  The Generalization Ability of SVM Classification Based on Markov Sampling , 2015, IEEE Transactions on Cybernetics.

[34]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .