Feature Selection for Cancer Classification on Microarray Expression Data

Microarray is an important tool in gene analysis research. It can help identify genes that might cause various cancers. In this paper, we use feature selection methods and the support vector machine (SVM) to search for the disease-causing genes in microarray data of three different cancers. The feature selection methods are based on Euclidian distance (ED) and Pearson correlation coefficient(PCC). We investigated the effect on prediction results by training the SVM with different numbers of features and different kinds of kernels. The results show that linear kernel is the fittest kernel for this problem. Also, equal or higher accuracy can be achieved with only 15 to 100 features which are selected from 7129 or more features of the original data sets.

[1]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[2]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[3]  Tsuyoshi Kato,et al.  Classification of heterogeneous microarray data by maximum entropy kernel , 2007, BMC Bioinformatics.

[4]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[5]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Junying Zhang,et al.  Support vector machine classifications for microarray expression data set , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[7]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[8]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[9]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[10]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[11]  Sung-Bae Cho,et al.  Cancer classification using ensemble of neural networks with multiple significant gene subsets , 2007, Applied Intelligence.

[12]  Margaret Gardiner-Garden,et al.  A Comparison of Microarray Databases , 2001, Briefings Bioinform..

[13]  Sung-Bae Cho,et al.  Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features , 2002, Proc. IEEE.