A new evolutionary gene selection technique

Microarray technology allows to investigate gene expression levels by analyzing high dimensional datasets of few samples. Selection of discriminative, differentially expressed genes from such datasets is important to differentiate, prognose and understand the underlying biological processes. In this regard, the paper presents a new evolutionary gene selection method based on Student-t Stochastic Neighbor Embedding (t-SNE), Differential Evolution (DE) and Support Vector Machine (SVM). Here the underlying classification task of SVM is used as an optimization problem of DE, while t-SNE provides better ordering of genes for selection purpose. Generally, t-SNE is used to reorder the genes in such a way so that similar genes are grouped together and dissimilar genes are kept further apart. These reordered genes are then fragmented into fixed-length partitions. Thereafter, from each partition, a gene is selected randomly to encode the initial population of DE along with the combination of its weight and threshold values in order to participate in fitness computation. In the final generation of DE, a subset of genes is selected based on higher classification accuracy. The proposed technique is tested on six publicly available microarray datasets concerning various cancerous tissues of Homo sapiens and yields a potential set of genes by providing prefect or nearly perfect classification accuracy. Moreover, the superiority of the proposed technique has been demonstrated in comparison with other widely used techniques. Finally, the achieved results have also been justified by a statistical test and allowed us to draw biological conclusions through the identification of Gene Ontologies.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[4]  Andrew R. Jamieson,et al.  Exploring nonlinear feature space dimension reduction and data representation in breast Cadx with Laplacian eigenmaps and t-SNE. , 2009, Medical physics.

[5]  Robert Stevens,et al.  Gene Ontology Consortium , 2014 .

[6]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Ujjwal Maulik,et al.  Ensemble learning prediction of protein-protein interactions using proteins functional annotations. , 2014, Molecular bioSystems.

[8]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[9]  Rami N. Khushaba,et al.  Feature subset selection using differential evolution and a wheel based search strategy , 2013, Swarm Evol. Comput..

[10]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[11]  Yanqing Zhang,et al.  Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[12]  Xavier Llorà,et al.  Scaling Genetic Algorithms Using MapReduce , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[13]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[14]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  E. Ng,et al.  Differential expression of microRNAs in plasma of patients with colorectal cancer: a potential marker for colorectal cancer screening , 2009, Gut.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  L. Aaltonen,et al.  Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis , 2007, Oncogene.

[18]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[19]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[21]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[22]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[23]  Yusuf Tutar,et al.  miRNA and cancer; computational and experimental approaches. , 2014, Current pharmaceutical biotechnology.

[24]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[25]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[26]  Adel Al-Jumaily,et al.  Feature subset selection using differential evolution and a statistical repair mechanism , 2011, Expert Syst. Appl..

[27]  Krzysztof Fujarewicz,et al.  Stable feature selection and classification algorithms for multiclass microarray data , 2012, Biology Direct.

[28]  Amir Jazaeri,et al.  Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. , 2003, Cancer research.

[29]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[30]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[31]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[32]  Ujjwal Maulik,et al.  Improved differential evolution for microarray analysis , 2012, Int. J. Data Min. Bioinform..

[33]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Ujjwal Maulik,et al.  Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis , 2011, Expert Syst. Appl..

[36]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[37]  Enrique Alba,et al.  Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms , 2007, 2007 IEEE Congress on Evolutionary Computation.