Efficient Feature Selection via $\ell _{2, 0}$ℓ2, 0-norm Constrained Sparse Regression

Sparse regression based feature selection method has been extensively investigated these years. However, because it has a non-convex constraint, i.e., $\ell _{2,0}$l2,0-norm constraint, this problem is very hard to solve. In this paper, unlike most of the other methods which only solve its slack version by introducing sparsity regularization into objective function forcibly, a novel framework is proposed by us to solve the original $\ell _{2,0}$l2,0-norm constrained sparse regression based feature selection problem. We transform our objective function into Linear Discriminant Analysis (LDA) by using a new label coding method, thus enabling our model to calculate the ratio of inter-class scatter to intra-class scatter of features which is the most widely used feature discrimination evaluation metric. According to that ratio, features can be selected by a simple sorting method. The projection gradient descent algorithm is introduced to further improve the performance of our algorithm by using the solution obtained before as its initial solution. This ensures the stability of this iterative algorithm. We prove that the proposed method can get the global optimal solution of this non-convex problem when all features are statistically independent. For the general case where features are statistically dependent, extensive experiments on six small sample size datasets and one large-scale dataset show that our algorithm has comparable or better classification capability comparing with other eight state-of-the-art feature selection methods by the SVM classifier. We also show that our algorithm can obtain a low loss value, which means the solution of our algorithm can get very close to this NP-hard problem’s real solution. What is more, because we solve the original $\ell _{2,0}$l2,0-norm constrained problem, we avoid the heavy work of tuning the regularization parameter because its meaning is explicit in our method, i.e., the number of selected features. At last, we evaluate the stability of our algorithm from two perspectives, i.e., the objective function values and the selected features, by experiments. From both perspectives, our algorithm shows satisfactory stability performance.

[1]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..

[2]  Weiwei Liu,et al.  Making Decision Trees Feasible in Ultrahigh Feature and Label Dimensions , 2017, J. Mach. Learn. Res..

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[5]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[6]  Paul H. Calamai,et al.  Projected gradient methods for linearly constrained problems , 1987, Math. Program..

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  Yueting Zhuang,et al.  Graph Regularized Feature Selection with Data Reconstruction , 2016, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[10]  Feiping Nie,et al.  Robust Object Co-Segmentation Using Background Prior , 2018, IEEE Transactions on Image Processing.

[11]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Xi Chen,et al.  Accelerated Gradient Method for Multi-task Sparse Learning Problem , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[14]  Bo Tang,et al.  Toward Optimal Feature Selection in Naive Bayes for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[15]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[16]  Chris H. Q. Ding,et al.  Towards Structural Sparsity: An Explicit l2/l0 Approach , 2010, ICDM.

[17]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[18]  Jian Yang,et al.  Robust Joint Feature Weights Learning Framework , 2016, IEEE Transactions on Knowledge and Data Engineering.

[19]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[22]  Marco Cristani,et al.  Infinite Feature Selection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Bart J. A. Mertens,et al.  Biomarker discovery in MALDI-TOF serum protein profiles using discrete wavelet transformation , 2009, Bioinform..

[24]  Javier Portilla,et al.  L0-Norm-Based Sparse Representation Through Alternate Projections , 2006, 2006 International Conference on Image Processing.

[25]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[26]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  Konstantinos N. Plataniotis,et al.  Face recognition using LDA-based algorithms , 2003, IEEE Trans. Neural Networks.

[29]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[30]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[31]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[32]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[33]  Feiping Nie,et al.  Feature Selection via Global Redundancy Minimization , 2015, IEEE Transactions on Knowledge and Data Engineering.

[34]  Dong Xu,et al.  Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey , 2018, IEEE Signal Processing Magazine.

[35]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[36]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[37]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[38]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[39]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[40]  Shuai Wang,et al.  UDSFS: Unsupervised deep sparse feature selection , 2016, Neurocomputing.

[41]  George Forman,et al.  Extremely fast text feature extraction for classification and indexing , 2008, CIKM '08.

[42]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[43]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[44]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[45]  Weiwei Liu,et al.  Sparse Embedded k-Means Clustering , 2017, NIPS.

[46]  Zehang Sun,et al.  Object detection using feature subset selection , 2004, Pattern Recognit..

[47]  Feiping Nie,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Exact Top-k Feature Selection via ℓ2,0-Norm Constraint , 2022 .

[48]  Jing Liu,et al.  Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection , 2014, IEEE Transactions on Knowledge and Data Engineering.

[49]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[50]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[51]  Weiwei Liu,et al.  Sparse Perceptron Decision Tree for Millions of Dimensions , 2016, AAAI.