Feature selection model based on clustering and ranking in pipeline for microarray data

Abstract Most of the available feature selection techniques in the literature are classifier bound. It means a group of features tied to the performance of a specific classifier as applied in wrapper and hybrid approach. Our objective in this study is to select a set of generic features not tied to any classifier based on the proposed framework. This framework uses attribute clustering and feature ranking techniques in pipeline in order to remove redundant features. On each uncovered cluster, signal-to-noise ratio, t -statistics and significance analysis of microarray are independently applied to select the top ranked features. Both filter and evolutionary wrapper approaches have been considered for feature selection and the data set with selected features are given to ensemble of predefined statistically different classifiers. The class labels of the test data are determined using majority voting technique. Moreover, with the aforesaid objectives, this paper focuses on obtaining a stable result out of various classification models. Further, a comparative analysis has been performed to study the classification accuracy and computational time of the current approach and evolutionary wrapper techniques. It gives a better insight into the features and further enhancing the classification accuracy with less computational time.

[1]  Yuan-De Tan,et al.  Ranking analysis of microarray data: a powerful method for identifying differentially expressed genes. , 2006, Genomics.

[2]  M. Moattar,et al.  A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification , 2016 .

[3]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[4]  K. Premalatha,et al.  Performance Analysis of Genetic Algorithm with kNN and SVM for Feature Selection in Tumor Classification , 2014 .

[5]  Gil Alterovitz,et al.  Incremental wrapper based gene selection with Markov blanket , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[6]  Vitor Santos,et al.  Ensemble Feature Ranking Applied to Medical Data , 2014 .

[7]  Xiaowei Yang,et al.  An efficient gene selection algorithm based on mutual information , 2009, Neurocomputing.

[8]  Muchenxuan Tong,et al.  An ensemble of SVM classifiers based on gene pairs , 2013, Comput. Biol. Medicine.

[9]  Loris Nanni,et al.  Orthogonal linear discriminant analysis and feature selection for micro-array data classification , 2010, Expert Syst. Appl..

[10]  Qingshan Jiang,et al.  A L1-regularized feature selection method for local dimension reduction on microarray data , 2017, Comput. Biol. Chem..

[11]  Verónica Bolón-Canedo,et al.  Data classification using an ensemble of filters , 2014, Neurocomputing.

[12]  Stuart W. Card Information distance based fitness and diversity metrics , 2010, GECCO '10.

[13]  Hichem Frigui,et al.  Simultaneous clustering and attribute discrimination , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[14]  Enrique Alba,et al.  Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments , 2016, Appl. Soft Comput..

[15]  Gil Alterovitz,et al.  Improving PLS-RFE based gene selection for microarray data classification , 2015, Comput. Biol. Medicine.

[16]  George Stephanopoulos,et al.  Determination of minimum sample size and discriminatory expression patterns in microarray data , 2002, Bioinform..

[17]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[18]  Pradipta Maji,et al.  Rough set based maximum relevance-maximum significance criterion and Gene selection from microarray data , 2011, Int. J. Approx. Reason..

[19]  Mehmet Ali Öztürk,et al.  Fuzzy soft rings and fuzzy soft ideals , 2011, Neural Computing and Applications.

[20]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[21]  Guan Yong,et al.  Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm , 2010, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.

[22]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[23]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[24]  Michael K. Ng,et al.  Feature weight estimation for gene selection: a local hyperlinear learning approach , 2014, BMC Bioinformatics.

[25]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[26]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[27]  Yuehui Chen,et al.  A novel ensemble of classifiers for microarray data classification , 2008, Appl. Soft Comput..

[28]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Ahmed Al-Ani Ant Colony Optimization for Feature Subset Selection , 2005, WEC.

[30]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[31]  Subhagata Chattopadhyay,et al.  Comparing Fuzzy-C Means and K-Means Clustering Techniques: A Comprehensive Study , 2012 .

[32]  José García-Nieto,et al.  Parallel multi-swarm optimizer for gene selection in DNA microarrays , 2011, Applied Intelligence.

[33]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.

[34]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[35]  Barnali Sahu,et al.  A Novel Feature Selection Algorithm using Particle Swarm Optimization for Cancer Microarray Data , 2012 .

[36]  Chen Zhang,et al.  K-means Clustering Algorithm with Improved Initial Center , 2009, 2009 Second International Workshop on Knowledge Discovery and Data Mining.

[37]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[38]  Amir Hossein Zaji,et al.  Improving the performance of multi-layer perceptron and radial basis function models with a decision tree model to predict flow variables in a sharp 90° bend , 2016, Appl. Soft Comput..

[39]  Bijan Bihari Misra,et al.  Pipelining the ranking techniques for microarray data classification: A case study , 2016, Appl. Soft Comput..

[40]  Jin-Kao Hao,et al.  A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data , 2006, EvoWorkshops.

[41]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Orhan Kesemen,et al.  Fuzzy c-means clustering algorithm for directional data (FCM4DD) , 2016, Expert Syst. Appl..

[43]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[44]  Dan Wang,et al.  Data stream clustering based on Fuzzy C-Mean algorithm and entropy theory , 2016, Signal Process..

[45]  Li-Yeh Chuang,et al.  Improved binary PSO for feature selection using gene expression data , 2008, Comput. Biol. Chem..

[46]  Sushmita Mitra,et al.  Fuzzy clustering with biological knowledge for gene selection , 2014, Appl. Soft Comput..

[47]  Jack Y. Yang,et al.  Partial Least Squares Based Dimension Reduction with Gene Selection for Tumor Classification , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[48]  Ji-Xiang Du,et al.  Ensemble component selection for improving ICA based microarray data prediction models , 2009, Pattern Recognit..

[49]  Yan Ma,et al.  Real-time feature selection in traffic classification , 2008 .

[50]  C. Devi Arockia Vanitha,et al.  Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection☆ , 2015 .

[51]  Debahuti Mishra,et al.  A novel approach for selecting informative genes from gene expression data using Signal-to-Noise Ratio and t-statistics , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[52]  Alfred O. Hero,et al.  Network constrained clustering for gene microarray data , 2005, Bioinform..

[53]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[54]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[55]  Florentino Fernández Riverola,et al.  A novel ensemble of classifiers that use biological relevant gene sets for microarray classification , 2014, Appl. Soft Comput..

[56]  Sungzoon Cho,et al.  GA-SVM wrapper approach for feature subset selection in keystroke dynamics identity verification , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[57]  Chen-An Tsai,et al.  Testing for differentially expressed genes with microarray data. , 2003, Nucleic acids research.

[58]  Qi Liu,et al.  Improving gene set analysis of microarray data by SAM-GS , 2007, BMC Bioinformatics.

[59]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[60]  Hua Wang,et al.  Combined Gene Selection Methods for Microarray Data Analysis , 2006, KES.

[61]  Sanjay Kumar Dubey,et al.  Comparative Analysis of K-Means and Fuzzy C- Means Algorithms , 2013 .

[62]  Xiangyang Wang,et al.  Feature selection based on rough sets and particle swarm optimization , 2007, Pattern Recognit. Lett..

[63]  Antônio de Pádua Braga,et al.  GA-KDE-Bayes: an evolutionary wrapper method based on non-parametric density estimation applied to bioinformatics problems , 2013, ESANN.

[64]  S B Kotsiantis,et al.  RETRACTED ARTICLE: Feature selection for machine learning classification problems: a recent overview , 2014, Artificial Intelligence Review.

[65]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[66]  Zuren Feng,et al.  An efficient ant colony optimization approach to attribute reduction in rough set theory , 2008, Pattern Recognit. Lett..

[67]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[68]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[69]  Jose Crispin Hernandez Hernandez,et al.  Hybrid Filter-Wrapper with a Specialized Random Multi-Parent Crossover Operator for Gene Selection and Classification Problems , 2011, ICIC.

[70]  Jianping Li,et al.  Parameter selection of support vector machines and genetic algorithm based on change area search , 2011, Neural Computing and Applications.

[71]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[72]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Krzysztof Michalak,et al.  Correlation based feature selection method , 2010, Int. J. Bio Inspired Comput..

[74]  J. K. Bertrand,et al.  The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification. , 2007, Mathematical medicine and biology : a journal of the IMA.

[75]  Eugene Jun Korea's Robotland: Merging Intelligent Robotics Strategic Policy, Business Development, and Fun , 2009, FIRA RoboWorld Congress.

[76]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[77]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..