L2 Regularization for Learning Kernels

The choice of the kernel is critical to the success of many learning algorithms but it is typically left to the user. Instead, the training data can be used to learn the kernel by selecting it out of a given family, such as that of non-negative linear combinations of p base kernels, constrained by a trace or L1 regularization. This paper studies the problem of learning kernels with the same family of kernels but with an L2 regularization instead, and for regression problems. We analyze the problem of learning kernels with ridge regression. We derive the form of the solution of the optimization problem and give an efficient iterative algorithm for computing that solution. We present a novel theoretical analysis of the problem based on stability and give learning bounds for orthogonal kernels that contain only an additive term O(√p/m) when compared to the standard kernel ridge regression stability bound. We also report the results of experiments indicating that L1 regularization can lead to modest improvements for a small number of kernels, but to performance degradations in larger-scale cases. In contrast, L2 regularization never degrades performance and in fact achieves significant improvements with a large number of kernels.

[1]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[2]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[3]  Charles A. Micchelli,et al.  Learning Convex Combinations of Continuously Parameterized Basic Kernels , 2005, COLT.

[4]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[5]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[8]  Mehryar Mohri,et al.  Learning sequence kernels , 2008 .

[9]  Charles A. Micchelli,et al.  A DC-programming algorithm for kernel selection , 2006, ICML.

[10]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[11]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[12]  William Stafford Noble,et al.  Nonstationary kernel combination , 2006, ICML.

[13]  Shai Ben-David,et al.  Learning Bounds for Support Vector Machines with Learned Kernels , 2006, COLT.

[14]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[17]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[18]  Olivier Bousquet,et al.  On the Complexity of Learning the Kernel Matrix , 2002, NIPS.

[19]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.