Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods

AbstractMicroarray datasets play a crucial role in cancer detection. But the high dimension of these datasets makes the classification challenging due to the presence of many irrelevant and redundant features. Hence, feature selection becomes irreplaceable in this field because of its ability to remove the unrequired features from the system. As the task of selecting the optimal number of features is an NP-hard problem, hence, some meta-heuristic search technique helps to cope up with this problem. In this paper, we propose a 2-stage model for feature selection in microarray datasets. The ranking of the genes for the different filter methods are quite diverse and effectiveness of rankings is datasets dependent. First, we develop an ensemble of filter methods by considering the union and intersection of the top-n features of ReliefF, chi-square, and symmetrical uncertainty. This ensemble allows us to combine all the information of the three rankings together in a subset. In the next stage, we use genetic algorithm (GA) on the union and intersection to get the fine-tuned results, and union performs better than the latter. Our model has been shown to be classifier independent through the use of three classifiers—multi-layer perceptron (MLP), support vector machine (SVM), and K-nearest neighbor (K-NN). We have tested our model on five cancer datasets—colon, lung, leukemia, SRBCT, and prostate. Experimental results illustrate the superiority of our model in comparison to state-of-the-art methods. Graphical abstractᅟ

[1]  Jin-Kao Hao,et al.  A memetic algorithm for gene selection and molecular classification of cancer , 2009, GECCO.

[2]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[3]  Jaume Bacardit,et al.  RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers , 2017, BMC Bioinformatics.

[4]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[5]  I. Chung,et al.  Identification of Single- and Multiple-Class Specific Signature Genes from Gene Expression Profiles by Group Marker Index , 2011, PloS one.

[6]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  R. Manfredini,et al.  Unravelling the Complexity of Inherited Retinal Dystrophies Molecular Testing: Added Value of Targeted Next-Generation Sequencing , 2016, BioMed research international.

[8]  Seoung Bum Kim,et al.  Sequential random k-nearest neighbor feature selection for high-dimensional data , 2015, Expert Syst. Appl..

[9]  Mohammed Al-Shalalfa,et al.  An Integrated Framework for Fuzzy Classification and Analysis of Gene Expression Data , 2010, Strategic Advancements in Utilizing Data Mining and Warehousing Technologies.

[10]  Hossein Nezamabadi-pour,et al.  An advanced ACO algorithm for feature subset selection , 2015, Neurocomputing.

[11]  B.W.C. Bongaerts,et al.  Alcohol consumption as a risk factor for colorectal cancer; An epidemiological study on genetic susceptibility and molecular endpoints , 2005 .

[12]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[13]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[14]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[15]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Jun Guo,et al.  Supervised Isomap with Explicit Mapping , 2006, First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06).

[18]  W. Spears,et al.  On the Virtues of Parameterized Uniform Crossover , 1995 .

[19]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[20]  M. Moattar,et al.  A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification , 2016 .

[21]  Xiaokang Zhang,et al.  Global feature selection from microarray data using Lagrange multipliers , 2016, Knowl. Based Syst..

[22]  Soo-Young Lee,et al.  Extraction of independent discriminant features for data with asymmetric distribution , 2011, Knowledge and Information Systems.

[23]  愛 若松,et al.  Cancer specific biomarkers , 2008 .

[24]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[25]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[26]  Georgios S. Ioannidis,et al.  PKM2 as a biomarker for chemosensitivity to front-line platinum-based chemotherapy in patients with metastatic non-small-cell lung cancer , 2014, British Journal of Cancer.

[27]  Raj Chari,et al.  Transcriptome Profiles of Carcinoma-in-Situ and Invasive Non-Small Cell Lung Cancer as Revealed by SAGE , 2010, PloS one.

[28]  Axel Kowald,et al.  Serum‐autoantibodies for discovery of prostate cancer specific biomarkers , 2012, The Prostate.

[29]  Sergio Bittanti,et al.  Unsupervised Mining of Genes Classifying Leukemia , 2005 .

[30]  Xin Jin,et al.  Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles , 2006, BioDM.

[31]  Ralph Weissleder,et al.  Detection of early prostate cancer using a hepsin-targeted imaging agent. , 2008, Cancer research.

[32]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[33]  Jiang Gu,et al.  Immunoglobulin G Expression in Lung Cancer and Its Effects on Metastasis , 2014, PloS one.

[34]  Zulaiha Ali Othman,et al.  Metaheuristic approach for an enhanced mRMR filter method for classification using drug response microarray data , 2017, Expert Syst. Appl..

[35]  Yan Zhang,et al.  Application of ReliefF algorithm to selecting feature sets for classification of high resolution remote sensing image , 2016, 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS).

[36]  Katherine B. D'Antonio,et al.  Analysis of novel targets in the pathobiology of prostate cancer , 2009 .

[37]  Enrique Alba,et al.  Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments , 2016, Appl. Soft Comput..

[38]  Jinru Shia,et al.  The prognostic significance of CXCL1 hypersecretion by human colorectal cancer epithelia and myofibroblasts , 2015, Journal of Translational Medicine.

[39]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[40]  Avinash R. Vaidya,et al.  Neural Mechanisms for Undoing the “Curse of Dimensionality” , 2015, The Journal of Neuroscience.

[41]  Philip Calvert,et al.  Encyclopedia of Data Warehousing and Mining , 2006 .

[42]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[43]  Joanna Polanska,et al.  Comprehensive Analysis of MILE Gene Expression Data Set Advances Discovery of Leukaemia Type and Subtype Biomarkers , 2017, Interdisciplinary Sciences: Computational Life Sciences.

[44]  Haider Banka,et al.  A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation , 2015, Pattern Recognit. Lett..

[45]  Li-Yeh Chuang,et al.  Gene selection and classification using Taguchi chaotic binary particle swarm optimization , 2011, Expert Syst. Appl..

[46]  Sreejit Chakravarty,et al.  Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system , 2016, Swarm Evol. Comput..

[47]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[48]  Vinod Kumar Jain,et al.  Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification , 2018, Appl. Soft Comput..

[49]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[50]  E. Noel,et al.  Differential gene expression in the peripheral zone compared to the transition zone of the human prostate gland , 2008, Prostate Cancer and Prostatic Diseases.

[51]  Hui-Huang Hsu,et al.  Hybrid feature selection by combining filters and wrappers , 2011, Expert Syst. Appl..

[52]  Victoria Y. Bird,et al.  Trends in Gene Expression Profiling for Prostate Cancer Risk Assessment: A Systematic Review , 2017, Biomedicine Hub.

[53]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[54]  Kun-Pin Wu,et al.  Prioritization of Cancer Marker Candidates Based on the Immunohistochemistry Staining Images Deposited in the Human Protein Atlas , 2013, PloS one.

[55]  Joaquim F. Pinto da Costa,et al.  A Weighted Principal Component Analysis and Its Application to Gene Expression Data , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[56]  Belén Melián-Batista,et al.  High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach , 2016, Inf. Sci..

[57]  Sergio Bittanti,et al.  FROM DNA MICRO-ARRAYS TO DISEASE CLASSIFICATION: AN UNSUPERVISED CLUSTERING APPROACH , 2005 .

[58]  O. Govaere,et al.  Molecular markers associated with outcome and metastasis in human pancreatic cancer , 2012, Journal of experimental & clinical cancer research : CR.

[59]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  V. Prasolov,et al.  Altered Expression of Multiple Genes Involved in Retinoic Acid Biosynthesis in Human Colorectal Cancer , 2014, Pathology & Oncology Research.

[61]  Xiaosheng Wang,et al.  Identification of genes highly downregulated in pancreatic cancer through a meta-analysis of microarray datasets: implications for discovery of novel tumor-suppressor genes and therapeutic targets , 2018, Journal of Cancer Research and Clinical Oncology.

[62]  Dong-Ling Tong,et al.  Genetic algorithm-neural network : feature extraction for bioinformatics data , 2010 .

[63]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[64]  Yuehui Chen,et al.  Computational Intelligence in Bioinformatics , 2008, Computational Intelligence in Bioinformatics.

[65]  Jiucheng Xu,et al.  Feature Genes Selection Using Supervised Locally Linear Embedding and Correlation Coefficient for Microarray Classification , 2018, Comput. Math. Methods Medicine.

[66]  Hossein Nezamabadi-pour,et al.  BGSA: binary gravitational search algorithm , 2010, Natural Computing.

[67]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[68]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[69]  Mohammad Reza Meybodi,et al.  Enriched ant colony optimization and its application in feature selection , 2014, Neurocomputing.

[70]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[71]  Bin Liang,et al.  Predicting Diagnostic Gene Biomarkers for Non-Small-Cell Lung Cancer , 2016, BioMed research international.