Regularization and feature selection for large dimensional data

Feature selection has evolved to be an important step in several machine learning paradigms. In domains like bio-informatics and text classification which involve data of high dimensions, feature selection can help in drastically reducing the feature space. In cases where it is difficult or infeasible to obtain sufficient number of training examples, feature selection helps overcome the curse of dimensionality which in turn helps improve performance of the classification algorithm. The focus of our research here are five embedded feature selection methods which use either the ridge regression, or Lasso regression, or a combination of the two in the regularization part of the optimization function. We evaluate five chosen methods on five large dimensional datasets and compare them on the parameters of sparsity and correlation in the datasets and their execution times.

[1]  Tapio Elomaa,et al.  A Walk from 2-Norm SVM to 1-Norm SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[2]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[5]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  F.C. Li,et al.  Comparison of feature selection approaches based on the SVM classification , 2008, 2008 IEEE International Conference on Industrial Engineering and Engineering Management.

[12]  Dale N. Richardson,et al.  Deciphering the Plant Splicing Code: Experimental and Computational Approaches for Predicting Alternative Splicing and Splicing Regulatory Elements , 2012, Front. Plant Sci..

[13]  Yi Liu,et al.  FS_SFS: A novel feature selection method for support vector machines , 2006, Pattern Recognit..

[14]  Roger E Bumgarner,et al.  Multiclass classification of microarray data with repeated measurements: application to cancer , 2003, Genome Biology.

[15]  Huan Liu,et al.  Feature Selection with Linked Data in Social Media , 2012, SDM.