Model Selection with the Covering Number of the Ball of RKHS

Model selection in kernel methods is the problem of choosing an appropriate hypothesis space for kernel-based learning algorithms to avoid either underfitting or overfitting of the resulting hypothesis. One of main problems faced by model selection is how to control the sample complexity when designing the model selection criterion. In this paper, we take balls of reproducing kernel Hilbert spaces (RKHSs) as candidate hypothesis spaces and propose a novel model selection criterion via minimizing the empirical optimal error in the ball of RKHS and the covering number of the ball. By introducing the covering number to measure the capacity of the ball of RKHS, our criterion could directly control the sample complexity. Specifically, we first prove the relation between expected optimal error and empirical optimal error in the ball of RKHS. Using the relation as the theoretical foundation, we give the definition of our criterion. Then, by estimating the expectation of optimal empirical error and proving an upper bound of the covering number, we represent our criterion as a functional of the kernel matrix. An efficient algorithm is further developed for approximately calculating the functional so that the fast Fourier transform (FFT) can be applied to achieve a quasi-linear computational complexity. We also prove the consistency between the approximate criterion and the accurate one for large enough samples. Finally, we empirically evaluate the performance of our criterion and verify the consistency between the approximate and accurate criterion.

[1]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[2]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[3]  Tu Bao Ho,et al.  Kernel Matrix Evaluation , 2007, IJCAI.

[4]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[5]  Yuesheng Xu,et al.  Approximation of high-dimensional kernel matrices by multilevel circulant matrices , 2010, J. Complex..

[6]  Mehryar Mohri,et al.  L2 Regularization for Learning Kernels , 2009, UAI.

[7]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[8]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[9]  V. V. Vasin Relationship of several variational methods for the approximate solution of ill-posed problems , 1970 .

[10]  Ding-Xuan Zhou,et al.  Capacity of reproducing kernel spaces in learning theory , 2003, IEEE Transactions on Information Theory.

[11]  Mehryar Mohri,et al.  Two-Stage Learning Kernel Algorithms , 2010, ICML.

[12]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[13]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[14]  Shizhong Liao,et al.  Approximate Model Selection for Large Scale LSSVM , 2011, ACML.

[15]  Marcello Sanguineti,et al.  Regularization Techniques and Suboptimal Solutions to Optimization Problems in Learning from Data , 2010, Neural Computation.

[16]  E. E. Tyrtyshnikov A unifying approach to some old and new theorems on distribution and clustering , 1996 .

[17]  Guohui Song Approximation of kernel matrices in machine learning , 2009 .

[18]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[19]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[20]  Yong Liu,et al.  Eigenvalues perturbation of integral operator for kernel selection , 2013, CIKM.

[21]  Bernhard Schölkopf,et al.  A Compression Approach to Support Vector Model Selection , 2004, J. Mach. Learn. Res..

[22]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[23]  Yong Liu,et al.  Learning kernels with upper bounds of leave-one-out error , 2011, CIKM '11.

[24]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[25]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[26]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[27]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[28]  Yong Liu,et al.  Efficient Approximation of Cross-Validation for Kernel Methods using Bouligand Influence Function , 2014, ICML.

[29]  M. Debruyne,et al.  Model Selection in Kernel Based Regression using the Influence Function , 2008 .

[30]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[31]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[32]  Gavin C. Cawley,et al.  Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters , 2007, J. Mach. Learn. Res..

[33]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[34]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.