Infinite Ensemble Learning with Support Vector Machines

Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of base hypotheses. However, existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. It is not clear whether we should construct an ensemble classifier with a larger or even infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on SVM. The framework can output an infinite and nonsparse ensemble, and can be used to construct new kernels for SVM as well as to interpret some existing ones. We demonstrate the framework with a concrete application, the stump kernel, which embodies infinitely many decision stumps. The stump kernel is simple, yet powerful. Experimental results show that SVM with the stump kernel is usually superior than boosting, even with noisy data.

[1]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[2]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[3]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[4]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[5]  Mark S. C. Reed,et al.  Method of Modern Mathematical Physics , 1972 .

[6]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[7]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[8]  Chih-Jen Lin,et al.  Formulations of Support Vector Machines: A Note from an Optimization Point of View , 2001, Neural Computation.

[9]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[10]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[11]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[12]  B. Baxter,et al.  Conditionally positive functions andp-norm distance matrices , 1991, 1006.2449.

[13]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[14]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[15]  S. Nash,et al.  Linear and Nonlinear Programming , 1987 .

[16]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[17]  Yaser S. Abu-Mostafa,et al.  CGBoost: Conjugate Gradient in Function Space , 2003 .

[18]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[19]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[20]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[23]  Leslie G. Valiant,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994, JACM.

[24]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[25]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[26]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[28]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[29]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[30]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[31]  C. Berg,et al.  Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[32]  O. Bousquet New approaches to statistical learning theory , 2003 .

[33]  L. Breiman Arcing Classifiers , 1998 .

[34]  Malik Magdon-Ismail,et al.  The Bin Model , 2004 .

[35]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[36]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[37]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[38]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[39]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[40]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[41]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[42]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[43]  Volker Tresp,et al.  A Bayesian Committee Machine , 2000, Neural Computation.

[44]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[45]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[46]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.