Kernel Matching Pursuit

Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the least-squares sense. We show how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution. We present a version of the algorithm that makes an optimal choice of both the next basis and the weights of all the previously chosen bases. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically much sparser models.

[1]  Zhifeng Zhang,et al.  Adaptive time-frequency decompositions , 1994 .

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Tomaso A. Poggio,et al.  A Sparse Representation for Function Approximation , 1998, Neural Computation.

[4]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[5]  Manfred K. Warmuth,et al.  Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1995, Machine Learning.

[6]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[7]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[8]  Bernhard Schölkopf,et al.  Semiparametric Support Vector and Linear Programming Machines , 1998, NIPS.

[9]  Yoram Singer,et al.  Leveraged Vector Machines , 1999, NIPS.

[10]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[11]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[12]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[13]  Steve R. Gunn,et al.  Structural Modelling with Sparse Kernels , 2002, Machine Learning.

[15]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[16]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[17]  D. Donoho,et al.  Basis pursuit , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Jason Weston,et al.  Density Estimation using Support Vector Machines , 1998 .

[22]  John Shawe-Taylor,et al.  Generalisation Error Bounds for Sparse Linear Classifiers , 2000, COLT.

[23]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[24]  Geoffrey E. Hinton,et al.  The delve manual , 1996 .

[25]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[26]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[27]  Shang-Liang Chen,et al.  Orthogonal least squares learning algorithm for radial basis function networks , 1991, IEEE Trans. Neural Networks.

[28]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[29]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[30]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[31]  Willi Hock,et al.  Lecture Notes in Economics and Mathematical Systems , 1981 .

[32]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[33]  J. Weston,et al.  Support vector density estimation , 1999 .

[34]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.