Discriminative Feature Selection via Employing Smooth and Robust Hinge Loss

A wide variety of sparsity-inducing feature selection methods have been developed in recent years. Most of the loss functions of these approaches are built upon regression since it is general and easy to optimize, but regression is not well suitable for classification. In contrast, the hinge loss (HL) of support vector machines has proved to be powerful to handle classification tasks, but a model with existing multiclass HL and sparsity regularization is difficult to optimize. In view of that, we propose a new loss, called smooth and robust HL, which gathers the merits of regression and HL but overcome their drawbacks, and apply it to our sparsity regularized feature selection model. To optimize the model, we present a new variant of accelerated proximal gradient (APG) algorithm, which boosts the discriminative margins among different classes, compared with standard APG algorithms. We further propose an efficient optimization technique to solve the proximal projection problem at each iteration step, which is a key component of the new APG algorithm. We theoretically prove that the new APG algorithm converges at rate $O({1}/{k^{2}})$ if it is convex ( $k$ is the iteration counter), which is the optimal convergence rate for smooth problems. Experimental results on nine publicly available data sets demonstrate the effectiveness of our method.

[1]  Christian Igel,et al.  A Unified View on Multi-class Support Vector Classification , 2016, J. Mach. Learn. Res..

[2]  Shiming Xiang,et al.  Retargeted Least Squares Regression Algorithm , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[3]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.

[4]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[5]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification and gene selection , 2008, Bioinform..

[6]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[7]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[8]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[9]  Shichao Zhang,et al.  Robust Joint Graph Sparse Coding for Unsupervised Spectral Feature Selection , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Chunhong Pan,et al.  Groupwise Retargeted Least-Squares Regression , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Dacheng Tao,et al.  Feature Selection Based on Structured Sparsity: A Comprehensive Study. , 2017, IEEE transactions on neural networks and learning systems.

[12]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[13]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[14]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[15]  Jieping Ye,et al.  An accelerated gradient method for trace norm minimization , 2009, ICML '09.

[16]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[17]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.

[18]  Feiping Nie,et al.  Feature Selection at the Discrete Limit , 2014, AAAI.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Marc Teboulle,et al.  Fast Gradient-Based Algorithms for Constrained Total Variation Image Denoising and Deblurring Problems , 2009, IEEE Transactions on Image Processing.

[21]  Feng Chu,et al.  A General Wrapper Approach to Selection of Class-Dependent Features , 2008, IEEE Transactions on Neural Networks.

[22]  Yong Fan,et al.  Feature selection by optimizing a lower bound of conditional mutual information , 2017, Inf. Sci..

[23]  Yong Fan,et al.  A General Framework for Sparsity Regularized Feature Selection via Iteratively Reweighted Least Square Minimization , 2017, AAAI.

[24]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[25]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[26]  William H. Hsu,et al.  Genetic wrappers for feature selection in decision tree induction and variable ordering in Bayesian network structure learning , 2004, Inf. Sci..

[27]  Junzhou Huang,et al.  Automatic Image Annotation and Retrieval Using Group Sparsity , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Songcan Chen,et al.  $l_{2,p}$ Matrix Norm and Its Application in Feature Selection , 2013, ArXiv.

[29]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[30]  H. Zou The Margin Vector , Admissible Loss and Multi-class Margin-based Classifiers , 2005 .

[31]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[32]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[33]  William Zhu,et al.  Sparse Graph Embedding Unsupervised Feature Selection , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[34]  Zongben Xu,et al.  Regularization: Convergence of Iterative Half Thresholding Algorithm , 2014 .

[35]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Gerhard Tutz,et al.  Feature selection guided by structural information , 2010, 1011.2315.

[37]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[38]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[39]  Feiping Nie,et al.  Discriminative Least Squares Regression for Multiclass Classification and Feature Selection , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[40]  Yufeng Liu,et al.  Support vector machines with adaptive Lq penalty , 2007, Comput. Stat. Data Anal..

[41]  Yong Fan,et al.  Direct Sparsity Optimization Based Feature Selection for Multi-Class Classification , 2016, IJCAI.

[42]  Xuelong Li,et al.  Joint Embedding Learning and Sparse Regression: A Framework for Unsupervised Feature Selection , 2014, IEEE Transactions on Cybernetics.

[43]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[44]  Feiping Nie,et al.  Effective Discriminative Feature Selection With Nontrivial Solution , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[46]  Qinghua Zheng,et al.  Adaptive Unsupervised Feature Selection With Structure Regularization , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[47]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[48]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[49]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[50]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[51]  M. Yuan,et al.  Reinforced Multicategory Support Vector Machines , 2011 .

[52]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[53]  Chris H. Q. Ding,et al.  Non-Convex Feature Learning via L_{p, inf} Operator , 2014, AAAI.