$\ell_{p}-\ell_{q}$ Penalty for Sparse Linear and Sparse Multiple Kernel Multitask Learning

Recently, there has been much interest around multitask learning (MTL) problem with the constraints that tasks should share a common sparsity profile. Such a problem can be addressed through a regularization framework where the regularizer induces a joint-sparsity pattern between task decision functions. We follow this principled framework and focus on ℓ<sub>p</sub>-ℓ<sub>q</sub> (with 0 ≤ <i>p</i> ≤ 1 and 1 ≤ <i>q</i> ≤ 2) mixed norms as sparsity-inducing penalties. Our motivation for addressing such a larger class of penalty is to adapt the penalty to a problem at hand leading thus to better performances and better sparsity pattern. For solving the problem in the general multiple kernel case, we first derive a variational formulation of the ℓ<sub>1</sub>-ℓ<sub>q</sub> penalty which helps us in proposing an alternate optimization algorithm. Although very simple, the latter algorithm provably converges to the global minimum of the ℓ<sub>1</sub>-ℓ<sub>q</sub> penalized problem. For the linear case, we extend existing works considering accelerated proximal gradient to this penalty. Our contribution in this context is to provide an efficient scheme for computing the ℓ<sub>1</sub>-ℓ<sub>q</sub> proximal operator. Then, for the more general case, when 0 <; <i>p</i> <; 1, we solve the resulting nonconvex problem through a majorization-minimization approach. The resulting algorithm is an iterative scheme which, at each iteration, solves a weighted ℓ<sub>1</sub>-ℓ<sub>q</sub> sparse MTL problem. Empirical evidences from toy dataset and real-word datasets dealing with brain-computer interface single-trial electroencephalogram classification and protein subcellular localization show the benefit of the proposed approaches and algorithms.

[1]  Eric P. Xing,et al.  Heterogeneous multitask learning with joint sparsity constraints , 2009, NIPS.

[2]  Thomas Lengauer,et al.  Multi-task learning for HIV therapy screening , 2008, ICML '08.

[3]  Taiji Suzuki,et al.  SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels , 2011, Machine Learning.

[4]  Daewon Lee,et al.  Constructing Sparse Kernel Machines Using Attractors , 2009, IEEE Transactions on Neural Networks.

[5]  Jinbo Bi,et al.  Probabilistic Joint Feature Selection for Multi-task Learning , 2007, SDM.

[6]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[7]  Larry A. Wasserman,et al.  Nonparametric regression and classification with joint sparsity constraints , 2008, NIPS.

[8]  Murat Dundar,et al.  An Improved Multi-task Learning Approach with Applications in Medical Diagnosis , 2008, ECML/PKDD.

[9]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[10]  Yves Grandvalet,et al.  Composite kernel learning , 2008, ICML '08.

[11]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[12]  Liva Ralaivola,et al.  Multiple indefinite kernel learning with mixed norm regularization , 2009, ICML '09.

[13]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[14]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[15]  Hal Daumé,et al.  Bayesian Multitask Learning with Latent Hierarchies , 2009, UAI.

[16]  M. Kowalski Sparse regression using mixed norms , 2009 .

[17]  Bhaskar D. Rao,et al.  Sparse solutions to linear inverse problems with multiple measurement vectors , 2005, IEEE Transactions on Signal Processing.

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[20]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[21]  Yiqiang Chen,et al.  Building Sparse Multiple-Kernel SVM Classifiers , 2009, IEEE Transactions on Neural Networks.

[22]  Masashi Sugiyama,et al.  Multi-Task Learning via Conic Programming , 2007, NIPS.

[23]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[24]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[25]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[26]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[27]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[28]  Alexander Zien,et al.  Non-Sparse Regularization and Efficient Training with Multiple Kernels , 2010, ArXiv.

[29]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[30]  Xi Chen,et al.  Accelerated Gradient Method for Multi-task Sparse Learning Problem , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[31]  Seiichi Ozawa,et al.  A Multitask Learning Model for Online Pattern Recognition , 2009, IEEE Transactions on Neural Networks.

[32]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[33]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[34]  Lawrence Carin,et al.  Multi-Task Learning for Classification with Dirichlet Process Priors , 2007, J. Mach. Learn. Res..

[35]  Cheng Soon Ong,et al.  An Automated Combination of Kernels for Predicting Protein Subcellular Localization , 2007, WABI.

[36]  Pierre Weiss,et al.  Algorithmes rapides d'optimisation convexe. Applications à la reconstruction d'images et à la détection de changements. (Fast algorithms for convex optimization. Applications to image reconstruction and change detection) , 2008 .

[37]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[38]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[39]  Nikolas P. Galatsanos,et al.  Sparse Bayesian Modeling With Adaptive Kernel Learning , 2009, IEEE Transactions on Neural Networks.

[40]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[41]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[42]  L. Carin,et al.  The Matrix Stick-Breaking Process , 2008 .

[43]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[44]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[45]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[46]  Massimiliano Pontil,et al.  Multi-task Learning , 2020, Transfer Learning.

[47]  I. Daubechies,et al.  Iteratively reweighted least squares minimization for sparse recovery , 2008, 0807.0575.

[48]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[49]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[50]  Touradj Ebrahimi,et al.  An efficient P300-based brain–computer interface for disabled subjects , 2008, Journal of Neuroscience Methods.

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  M. Kloft,et al.  Efficient and Accurate ` p-Norm Multiple Kernel Learning , 2009 .

[53]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[54]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[55]  Trevor Darrell,et al.  Transfer learning for image classification with sparse prototype representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Alain Rakotomamonjy,et al.  BCI Competition III: Dataset II- Ensemble of SVMs for BCI P300 Speller , 2008, IEEE Transactions on Biomedical Engineering.

[57]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[58]  Rayan Saab,et al.  Stable sparse approximations via nonconvex optimization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[60]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[61]  Kiri Wagstaff,et al.  Alpha seeding for support vector machines , 2000, KDD '00.

[62]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.