Double Sparsity Kernel Learning with Automatic Variable Selection and Data Extraction.

Learning in the Reproducing Kernel Hilbert Space (RKHS) has been widely used in many scientific disciplines. Because a RKHS can be very flexible, it is common to impose a regularization term in the optimization to prevent overfitting. Standard RKHS learning employs the squared norm penalty of the learning function. Despite its success, many challenges remain. In particular, one cannot directly use the squared norm penalty for variable selection or data extraction. Therefore, when there exists noise predictors, or the underlying function has a sparse representation in the dual space, the performance of standard RKHS learning can be suboptimal. In the literature, work has been proposed on how to perform variable selection in RKHS learning, and a data sparsity constraint was considered for data extraction. However, how to learn in a RKHS with both variable selection and data extraction simultaneously remains unclear. In this paper, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. An efficient algorithm is provided to solve the corresponding optimization problem. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Simulated and real data results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning.

[1]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[2]  Xiwu Lin,et al.  Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV , 2000 .

[3]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[4]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[5]  Yufeng Liu,et al.  Linear or Nonlinear? Automatic Structure Discovery for Partially Linear Models , 2011, Journal of the American Statistical Association.

[6]  Wei Sun,et al.  Multiple Response Regression for Gaussian Mixture Models with Known Labels , 2012, Stat. Anal. Data Min..

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Yufeng Liu,et al.  Fisher Consistency of Multicategory Support Vector Machines , 2007, AISTATS.

[9]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[10]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[11]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[12]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[13]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[14]  C. Geyer,et al.  Adaptive regularization using the entire solution surface. , 2009, Biometrika.

[15]  Genevera I. Allen Automatic Feature Selection via Weighted Kernels and Regularization , 2013 .

[16]  H. Minh,et al.  Some Properties of Gaussian Reproducing Kernel Hilbert Spaces and Their Implications for Function Approximation and Learning Theory , 2010 .

[17]  P. Massart,et al.  Statistical performance of support vector machines , 2008, 0804.0551.

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[20]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[21]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[22]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[23]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[24]  Yufeng Liu,et al.  Multicategory large-margin unified machines , 2013, J. Mach. Learn. Res..

[25]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[26]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[27]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification , 2007, ICML '07.

[28]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[29]  Yufeng Liu,et al.  On Quantile Regression in Reproducing Kernel Hilbert Spaces with the Data Sparsity Constraint , 2016, J. Mach. Learn. Res..

[30]  Kathrin Klamroth,et al.  Biconvex sets and optimization with biconvex functions: a survey and extensions , 2007, Math. Methods Oper. Res..

[31]  Dimitri P. Bertsekas,et al.  Convex Analysis and Optimization , 2003 .

[32]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[33]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[34]  Olvi L. Mangasarian,et al.  Nuclear feature extraction for breast tumor diagnosis , 1993, Electronic Imaging.

[35]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[36]  Hoai An Le Thi,et al.  Solving a Class of Linearly Constrained Indefinite Quadratic Problems by D.C. Algorithms , 1997 .

[37]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[38]  Phillip Ein-Dor,et al.  Attributes of the performance of central processing units: a relative performance prediction model , 1987, CACM.

[39]  Ji Zhu,et al.  L1-Norm Quantile Regression , 2008 .

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[42]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[43]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[44]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[45]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[46]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.