Theoretical and empirical analysis of filter ranking methods: Experimental study on benchmark DNA microarray data

Abstract DNA microarray experiments generate thousands of gene expression values that provide information about the state of cells and tissues. Though these expressive values are useful in disease classification, however, only a few genes contribute towards this classification. In this context, usage of feature selection algorithms can be beneficial, as the main goal of feature selection algorithms is to identify the relevant features (here genes) efficiently. In the recent past, many feature selection algorithms have been proposed in the literature that measure the relevancy and redundancy of the features using various evaluation criteria. An important type of feature selection techniques is feature ranking, which does not use any learning algorithm, rather assigns an important value or weight to a feature. In this paper, we provide an extensive study on 10 popularly used filter ranking methods. We have applied the methods to 10 microarray datasets (both binary class and multi-class) and tested the accuracies using three well-known classifiers namely Multi-layer Perceptron (MLP), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). We have conducted a wide variety of tests to assess the strength and weakness of various filter methods. This vast study provides a comparison amongst different filter methods helping researchers make an informed choice about selecting an appropriate filter method for their work. Three categories of filtering methods are tested, namely, Entropy based, Similarity based and Statistics based. The experiments show that out of all the methods Mutual Information (MI) gives the best results (also best among Entropy based methods). In the category of Similarity based methods ReliefF performs best and Chi-square performs best in the category of Statistics based methods. In case of bi-class datasets, Chi-square would be the better choice, while for multi-class datasets, MI gives better results.

[1]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  S R Gullans,et al.  DNA microarray analysis of complex biologic processes. , 2001, Journal of the American Society of Nephrology : JASN.

[3]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[4]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[5]  Hui-Huang Hsu,et al.  Hybrid feature selection by combining filters and wrappers , 2011, Expert Syst. Appl..

[6]  Wei Liang,et al.  Gene Selection Using Locality Sensitive Laplacian Score , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Enrique Alba,et al.  Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments , 2016, Appl. Soft Comput..

[8]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[9]  Mohamed F. Ghalwash,et al.  Minimum redundancy maximum relevance feature selection approach for temporal gene expression data , 2017, BMC Bioinformatics.

[10]  R. L. de Mantaras A Distance-Based Attribute Selection Measure for Decision Tree Induction , 1991 .

[11]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[14]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[15]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[16]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[17]  Ujjwal Maulik,et al.  Recursive Memetic Algorithm for gene selection in microarray data , 2019, Expert Syst. Appl..

[18]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[19]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[20]  Om Prakash Vyas,et al.  A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty , 2014 .

[21]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[22]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[23]  Miguel Ángel Guevara-López,et al.  Improving the Mann-Whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography , 2015, Artif. Intell. Medicine.

[24]  Mita Nasipuri,et al.  A Harmony Search Based Wrapper Feature Selection Method for Holistic Bangla word Recognition , 2017, ArXiv.

[25]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[28]  Ram Sarkar,et al.  Feature selection for facial emotion recognition using late hill-climbing based memetic algorithm , 2019, Multimedia Tools and Applications.

[29]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[30]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[31]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[32]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[33]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[34]  Pasi Luukka,et al.  Feature selection using fuzzy entropy measures with similarity classifier , 2011, Expert Syst. Appl..

[35]  Isabelle Guyon,et al.  An Introduction to Feature Extraction , 2006, Feature Extraction.

[36]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[37]  Natalia Shulzhenko,et al.  Microarrays for cancer diagnosis and classification. , 2007, Advances in experimental medicine and biology.

[38]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[39]  Avinash R. Vaidya,et al.  Neural Mechanisms for Undoing the “Curse of Dimensionality” , 2015, The Journal of Neuroscience.

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  Ram Sarkar,et al.  Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods , 2018, Medical & Biological Engineering & Computing.