Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data

Microarray data usually contains a huge number of genes (features) and a comparatively small number of samples, which make accurate classification or prediction of diseases challenging. Feature selection techniques can help us identify important and irrelevant (unimportant) features by applying certain selection criteria. However, different feature selection algorithms based on various theoretical arguments often produce different results when applied to the same data set. This makes selecting an optimal or near optimal feature subset for a data set difficult. In this paper, we propose using a genetic algorithm to improve feature subset selection by combining valuable outcomes from multiple feature selection methods. The goal of our genetic algorithm is to achieve a balance between the classification accuracy and the size of the feature subsets selected. The advantages of this approach include the ability to accommodate different feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. The experimental results demonstrate that our approach can find subsets of features with higher classification accuracy and/or smaller size compared with each individual feature selection algorithm.

[1]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[2]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[3]  Huan Liu,et al.  Handling Large Unsupervised Data via Dimensionality Reduction , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[4]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[5]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[8]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[9]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10]  Isabelle Guyon,et al.  Statistical Learning and Kernel Methods in Bioinformatics , 2003 .

[11]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[14]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.