Analysis of feature selection stability on high dimension and small sample data

Feature selection is an important step when building a classifier on high dimensional data. As the number of observations is small, the feature selection tends to be unstable. It is common that two feature subsets, obtained from different datasets but dealing with the same classification problem, do not overlap significantly. Although it is a crucial problem, few works have been done on the selection stability. The behavior of feature selection is analyzed in various conditions, not exclusively but with a focus on t -score based feature selection approaches and small sample data. The analysis is in three steps: the first one is theoretical using a simple mathematical model; the second one is empirical and based on artificial data; and the last one is based on real data. These three analyses lead to the same results and give a better understanding of the feature selection problem in high dimension data.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Korbinian Strimmer,et al.  Gene ranking and biomarker discovery under correlation , 2009, Bioinform..

[3]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[4]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.

[5]  Melanie Hilario,et al.  Stability of feature selection algorithms , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[7]  Qin Wang,et al.  Robust variable selection through MAVE , 2013, Comput. Stat. Data Anal..

[8]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[9]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[10]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[11]  Pavel Pudil,et al.  Criteria Ensembles in Feature Selection , 2009, MCS.

[12]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[13]  J. Ioannidis Microarrays and molecular research: noise discovery? , 2005, The Lancet.

[14]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Rajen Dinesh Shah,et al.  Variable selection with error control: another look at stability selection , 2011, 1105.5578.

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[20]  Jeffrey C Miecznikowski,et al.  Comparative survival analysis of breast cancer microarray studies identifies important prognostic genetic pathways , 2010, BMC Cancer.

[21]  Yue Han,et al.  A Variance Reduction Framework for Stable Feature Selection , 2010, 2010 IEEE International Conference on Data Mining.

[22]  Jon W. Huss,et al.  BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources , 2009, Genome Biology.

[23]  Pavel Pudil,et al.  Identifying the most Informative Variables for Decision-Making Problems - a Survey of Recent Approaches and Accompanying Problems , 2008 .

[24]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[25]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.