Markov blanket-embedded genetic algorithm for gene selection

Microarray technologies enable quantitative simultaneous monitoring of expression levels for thousands of genes under various experimental conditions. This new technology has provided a new way of biological classification on a genome-wide scale. However, predictive accuracy is affected by the presence of thousands of genes many of which are unnecessary from the classification point of view. So, a key issue of microarray data classification is to identify the smallest possible set of genes that can achieve good predictive accuracy. In this study, we propose a novel Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem. In particular, the embedded Markov blanket-based memetic operators add or delete features (or genes) from a genetic algorithm (GA) solution so as to quickly improve the solution and fine-tune the search. Empirical results on synthetic and microarray benchmark datasets suggest that MBEGA is effective and efficient in eliminating irrelevant and redundant features based on both Markov blanket and predictive power in classifier model. A detailed comparative study with other methods from each of filter, wrapper, and standard GA shows that MBEGA gives a best compromise among all four evaluation criteria, i.e., classification accuracy, number of selected genes, computational cost, and robustness.

[1]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[2]  T. Darden,et al.  Computational Analysis of Leukemia Microarray Expression Data Using the GA/KNN Method , 2002 .

[3]  J. M. Deutsch,et al.  Evolutionary algorithms for finding optimal gene sets in microarray prediction , 2003, Bioinform..

[4]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[5]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[6]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[7]  James E. Baker,et al.  Adaptive Selection Methods for Genetic Algorithms , 1985, International Conference on Genetic Algorithms.

[8]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[9]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[10]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[11]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  Kevin Kok Wai Wong,et al.  Classification of adaptive memetic algorithms: a comparative study , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  Pablo Moscato,et al.  On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts : Towards Memetic Algorithms , 1989 .

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Huan Liu,et al.  Fostering Biological Relevance in Feature Selection for Microarray Data , 2005 .

[17]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[18]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[19]  Edward R. Dougherty,et al.  The coefficient of intrinsic dependence (feature selection using el CID) , 2005, Pattern Recognit..

[20]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[21]  William H. Press,et al.  Numerical recipes in C , 2002 .

[22]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[23]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[24]  Zexuan Zhu,et al.  Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[26]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[27]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[28]  Z. Szallasi,et al.  A survey of methods for classification of gene expression data using evolutionary algorithms , 2006, Expert review of molecular diagnostics.

[29]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[30]  Xin Zhou,et al.  LS Bound based gene selection for DNA microarray data , 2005, Bioinform..

[31]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[32]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[33]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[34]  Natalio Krasnogor,et al.  Studies on the theory and design space of memetic algorithms , 2002 .

[35]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[36]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[37]  A. Dickson On Evolution , 1884, Science.

[38]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[39]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[40]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[41]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[43]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[45]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[46]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[47]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[48]  Andy J. Keane,et al.  Meta-Lamarckian learning in memetic algorithms , 2004, IEEE Transactions on Evolutionary Computation.

[49]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[50]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..

[51]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..