Voted Kernel Regularization

This paper presents an algorithm, Voted Kernel Regularization , that provides the flexibility of using potentially very complex kernel functions such as predictors based on much higher-degree polynomial kernels, while benefitting from strong learning guarantees. The success of our algorithm arises from derived bounds that suggest a new regularization penalty in terms of the Rademacher complexities of the corresponding families of kernel maps. In a series of experiments we demonstrate the improved performance of our algorithm as compared to baselines. Furthermore, the algorithm enjoys several favorable properties. The optimization problem is convex, it allows for learning with non-PDS kernels, and the solutions are highly sparse, resulting in improved classification speed and memory requirements.

[1]  Charles A. Micchelli,et al.  Learning Convex Combinations of Continuously Parameterized Basic Kernels , 2005, COLT.

[2]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[3]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[4]  Colin Campbell,et al.  Generalization Bounds for Learning the Kernel Problem , 2009, COLT.

[5]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[6]  Mehryar Mohri,et al.  Ensembles of Kernel Predictors , 2011, UAI.

[7]  Hui Zou An Improved 1-norm SVM for Simultaneous Classification and Variable Selection , 2007, AISTATS.

[8]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[9]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[10]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[11]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Charles A. Micchelli,et al.  A DC-programming algorithm for kernel selection , 2006, ICML.

[14]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[15]  Mehryar Mohri,et al.  Deep Boosting , 2014, ICML.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  Mehryar Mohri,et al.  Generalization Bounds for Learning Kernels , 2010, ICML.

[18]  Yoram Singer,et al.  Support Vector Machines on a Budget , 2006, NIPS.

[19]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[20]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[21]  William Stafford Noble,et al.  Nonstationary kernel combination , 2006, ICML.

[22]  Shai Ben-David,et al.  Learning Bounds for Support Vector Machines with Learned Kernels , 2006, COLT.

[23]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[24]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[25]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[26]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[27]  Shahar Mendelson,et al.  On the Performance of Kernel Classes , 2003, J. Mach. Learn. Res..