A Proximal Approach for Sparse Multiclass SVM

Sparsity-inducing penalties are useful tools to design multiclass support vector machines (SVMs). In this paper, we propose a convex optimization approach for efficiently and exactly solving the multiclass SVM learning problem involving a sparse regularization and the multiclass hinge loss formulated by Crammer and Singer. We provide two algorithms: the first one dealing with the hinge loss as a penalty term, and the other one addressing the case when the hinge loss is enforced through a constraint. The related convex optimization problems can be efficiently solved thanks to the flexibility offered by recent primal-dual proximal algorithms and epigraphical splitting techniques. Experiments carried out on several datasets demonstrate the interest of considering the exact expression of the hinge loss rather than a smooth approximation. The efficiency of the proposed algorithms w.r.t. several state-of-the-art methods is also assessed through comparisons of execution times.

[1]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[4]  Xiaotong Shen,et al.  On L1-Norm Multiclass Support Vector Machines , 2007 .

[5]  Laurent Condat Fast projection onto the simplex and the l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pmb {l}_\mathbf {1}$$\end{ , 2015, Mathematical Programming.

[6]  Yann LeCun,et al.  Large-scale Learning with SVM and Convolutional for Generic Object Categorization , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[8]  Lifeng Wang,et al.  On L_1-Norm Multi-class Support Vector Machines , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[9]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[10]  Laura Schweitzer,et al.  Advances In Kernel Methods Support Vector Learning , 2016 .

[11]  Ji Zhu,et al.  Variable selection for multicategory SVM via sup-norm regularization , 2006 .

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Josiane Mothe,et al.  Nonconvex Regularizations for Feature Selection in Ranking With Sparse SVM , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Bang Công Vu,et al.  A splitting algorithm for dual monotone inclusions involving cocoercive operators , 2011, Advances in Computational Mathematics.

[15]  H. Zou,et al.  The doubly regularized support vector machine , 2006 .

[16]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[17]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[18]  Yufeng Liu,et al.  Variable Selection via A Combination of the L0 and L1 Penalties , 2007 .

[19]  Lorenzo Rosasco,et al.  Proximal methods for the latent group lasso penalty , 2012, Computational Optimization and Applications.

[20]  Laurent Condat,et al.  A Fast Projection onto the Simplex and the l 1 Ball , 2015 .

[21]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[22]  Stéphane Canu,et al.  $\ell_{p}-\ell_{q}$ Penalty for Sparse Linear and Sparse Multiple Kernel Multitask Learning , 2011, IEEE Transactions on Neural Networks.

[23]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[24]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Laurent Condat,et al.  A Primal–Dual Splitting Method for Convex Optimization Involving Lipschitzian, Proximable and Linear Composite Terms , 2012, Journal of Optimization Theory and Applications.

[28]  Lorenzo Rosasco,et al.  Nonparametric sparsity and regularization , 2012, J. Mach. Learn. Res..

[29]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[30]  Giovanni Chierchia,et al.  Parallel implementations of a disparity estimation algorithm based on a Proximal splitting method , 2012, 2012 Visual Communications and Image Processing.

[31]  Alain Rakotomamonjy,et al.  Automatic Feature Learning for Spatio-Spectral Image Classification With Sparse SVM , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[32]  Yoram Singer,et al.  Boosting with structural sparsity , 2009, ICML '09.

[33]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[34]  J. Moreau Proximité et dualité dans un espace hilbertien , 1965 .

[35]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[36]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[37]  Yufeng Liu,et al.  Support vector machines with adaptive Lq penalty , 2007, Comput. Stat. Data Anal..

[38]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[41]  Nelly Pustelnik,et al.  Epigraphical Projection and Proximal Tools for Solving Constrained Convex Optimization Problems: Part I , 2012, ArXiv.

[42]  Carmen Peláez-Moreno,et al.  A Speech Recognizer Based on Multiclass SVMs with HMM-Guided Segmentation , 2005, NOLISP.

[43]  Kazuhiro Seki,et al.  Block coordinate descent algorithms for large-scale sparse multiclass classification , 2013, Machine Learning.

[44]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[45]  Patrick L. Combettes,et al.  A forward-backward view of some primal-dual optimization methods in image recovery , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[46]  Naonori Ueda,et al.  Large-Scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex , 2014, 2014 22nd International Conference on Pattern Recognition.

[47]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[48]  P. L. Combettes,et al.  Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators , 2011, Set-Valued and Variational Analysis.

[49]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[50]  Nelly Pustelnik,et al.  Epigraphical proximal projection for sparse multiclass SVM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  P. L. Combettes,et al.  Variable metric forward–backward splitting with applications to monotone inclusions in duality , 2012, 1206.6791.

[52]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[53]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[55]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[56]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[57]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[58]  Ivor W. Tsang,et al.  Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets , 2010, ICML.

[59]  H. Zou,et al.  The F ∞ -norm support vector machine , 2008 .

[60]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[61]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..