Some Greedy Learning Algorithms for Sparse Regression and Classification with Mercer Kernels

We present greedy learning algorithms for building sparse nonlinear regression and classification models from observational data using Mercer kernels. Our objective is to develop efficient numerical schemes for reducing the training and runtime complexities of kernel-based algorithms applied to large datasets. In the spirit of Natarajan's greedy algorithm (Natarajan, 1995), we iteratively minimize the L2 loss function subject to a specified constraint on the degree of sparsity required of the final model or till a specified stopping criterion is reached. We discuss various greedy criteria for basis selection and numerical schemes for improving the robustness and computational efficiency. Subsequently, algorithms based on residual minimization and thin QR factorization are presented for constructing sparse regression and classification models. During the course of the incremental model construction, the algorithms are terminated using model selection principles such as the minimum descriptive length (MDL) and Akaike's information criterion (AIC). Finally, experimental results on benchmark data are presented to demonstrate the competitiveness of the algorithms developed in this paper.

[1]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[2]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[3]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[4]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[5]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[6]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Lothar Reichel,et al.  Algorithm 686: FORTRAN subroutines for updating the QR decomposition , 1990, TOMS.

[9]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[10]  Edmond Chow,et al.  Approximate Inverse Preconditioners via Sparse-Sparse Iterations , 1998, SIAM J. Sci. Comput..

[11]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[12]  B. Natarajan On Learning Functions from Noise-Free and Noisy Samples via Occam's Razor , 1999, SIAM J. Comput..

[13]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[14]  Holger Wendland,et al.  Adaptive greedy techniques for approximate solution of large RBF systems , 2000, Numerical Algorithms.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Edmond Chow,et al.  Approximate Inverse Techniques for Block-Partitioned Matrices , 1997, SIAM J. Sci. Comput..

[17]  Yoram Bresler,et al.  On the Optimality of the Backward Greedy Algorithm for the Subset Selection Problem , 2000, SIAM J. Matrix Anal. Appl..

[18]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[19]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[20]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[21]  B.D. Rao,et al.  Comparison of basis selection methods , 1996, Conference Record of The Thirtieth Asilomar Conference on Signals, Systems and Computers.

[22]  G. Stewart,et al.  Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization , 1976 .

[23]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[24]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[25]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[26]  Tong Zhang,et al.  Some Sparse Approximation Bounds for Regression Problems , 2001, International Conference on Machine Learning.

[27]  Shang-Liang Chen,et al.  Orthogonal least squares learning algorithm for radial basis function networks , 1991, IEEE Trans. Neural Networks.

[28]  A. Mees,et al.  On selecting models for nonlinear time series , 1995 .

[29]  Marcus J. Grote,et al.  Parallel Preconditioning with Sparse Approximate Inverses , 1997, SIAM J. Sci. Comput..

[30]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[31]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[32]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..