Evaluating feature selection strategies for high dimensional, small sample size datasets

In this work, we analyze and evaluate different strategies for comparing Feature Selection (FS) schemes on High Dimensional (HD) biomedical datasets (e.g. gene and protein expression studies) with a small sample size (SSS). Additionally, we define a new feature, Robustness, specifically for comparing the ability of an FS scheme to be invariant to changes in its training data. While classifier accuracy has been the de facto method for evaluating FS schemes, on account of the curse of dimensionality problem, it might not always be the appropriate measure for HD/SSS datasets. SSS lends the dataset a higher probability of containing data that is not representative of the true distribution of the whole population. However, an ideal FS scheme must be robust enough to produce the same results each time there are changes to the training data. In this study, we employed the robustness performance measure in conjunction with classifier accuracy (measured via the K-Nearest Neighbor and Random Forest classifiers) to quantitatively compare five different FS schemes (T-test, F-test, Kolmogorov-Smirnov Test, Wilks Lambda Test and Wilcoxon Rand Sum Test) on 5 HD/SSS gene and protein expression datasets corresponding to ovarian cancer, lung cancer, bone lesions, celiac disease, and coronary heart disease. Of the five FS schemes compared, the Wilcoxon Rand Sum Test was found to outperform other FS schemes in terms of classification accuracy and robustness. Our results suggest that both classifier accuracy and robustness should be considered when deciding on the appropriate FS scheme for HD/SSS datasets.

[1]  Qingzhong Liu,et al.  Comparison of feature selection and classification for MALDI-MS data , 2009, BMC Genomics.

[2]  C. Wijmenga,et al.  Complex nature of SNP genotype effects on gene expression in primary human leucocytes , 2009, BMC Medical Genomics.

[3]  M L Giger,et al.  Feature selection with limited datasets. , 1999, Medical physics.

[4]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[5]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[7]  A. Madabhushi,et al.  Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Lubomir M. Hadjiiski,et al.  Effect of finite sample size on feature selection and classification: a simulation study. , 2010, Medical physics.

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  J. Piek,et al.  Suppression of inflammatory signaling in monocytes from patients with coronary artery disease. , 2009, Journal of molecular and cellular cardiology.

[11]  Juana Canul-Reich,et al.  Iterative feature perturbation method as a gene selector for microarray data , 2009 .

[12]  Adam M. Gustafson,et al.  Airway PI3K Pathway Activation Is an Early and Reversible Event in Lung Cancer Development , 2010, Science Translational Medicine.

[13]  George Lee,et al.  Computerized Medical Imaging and Graphics , 2022 .

[14]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.