An Evolutionary Gene Selection Method for Microarray Data Based on SVM Error Bound Theories

Microarrays have thousands to tens of thousands of gene features but patient samples are fewer or a few hundred. Identifying genes whose disruption causes congenital or acquired disease is the fundamental problem in microarray data analysis. In this paper, we propose an efficient evolutionary SVM-based classifier that can select smaller number of features with high accuracy. The proposed method uses SVM with a given subset of features to evaluate the fitness function, and new subset of features are selected based on several leave-one-out error bounds for the SVM classifier and the frequency of occurrence of the features in the evolutionary approach. We test our proposed method on different microarray data and find that the proposed method can obtain high classification accuracy with a smaller number of selected genes.

[1]  Wei Du,et al.  Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines , 2003, FEBS letters.

[2]  Hui-Ling Huang,et al.  ESVM: Evolutionary support vector machine for automatic feature selection and classification of microarray data , 2007, Biosyst..

[3]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[4]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[5]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[6]  Ole Winther,et al.  Gaussian processes and SVM: Mean field and leave-one-out estimator , 2000 .

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Chih-Jen Lin,et al.  Radius Margin Bounds for Support Vector Machines with the RBF Kernel , 2002, Neural Computation.

[9]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[10]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[11]  Wei-Chun Kao,et al.  Radius Margin Bounds for Support Vector . . . , 2003 .

[12]  Simon Lin,et al.  Methods of microarray data analysis III , 2002 .

[13]  Crispin J. Miller,et al.  Exploiting sample variability to enhance multivariate analysis of microarray data , 2007, Bioinform..

[14]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[15]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[16]  Hansong Zhang,et al.  Gacv for support vector machines , 2000 .

[17]  Lutgarde M. C. Buydens,et al.  Interpretation of ANOVA models for microarray data using PCA , 2007, Bioinform..

[18]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[19]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .

[20]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  H. Iba,et al.  Gene selection for classification of cancers using probabilistic model building genetic algorithm. , 2005, Bio Systems.

[22]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[23]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.