Feature Selection for Cancer Classification Based on Support Vector Machine

Feature selection plays an important role in cancer classification, for gene expression data usually have a large number of dimensions and relatively a small number of samples. In this paper, we use the support vector machine (SVM) for cancer classification. We propose a mixed two-step feature selection method. The first step uses a modified t-test method to select discriminatory features. The second step extracts principal components from the top-ranked genes based on the modified t-test method. We tested our two-step method in three data sets, i.e., the lymphoma data set, the SRBCT data set, and the ovarian cancer data set. The results in all the three data sets show our two-step methods is able to achieve 100% accuracy with much fewer genes than other published results.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[3]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[4]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[5]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[6]  I. Jolliffe Principal Component Analysis , 2002 .

[7]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[8]  Feng Chu,et al.  A General Wrapper Approach to Selection of Class-Dependent Features , 2008, IEEE Transactions on Neural Networks.

[9]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[10]  In-Beum Lee,et al.  New gene selection for classification of cancer subtype considering within-class variation , 2003 .

[11]  Li Yingxin and Ruan Xiaogang,et al.  Feature Selection for Cancer Classification Based on Support Vector Machine , 2005 .

[12]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[13]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[14]  Jin Hyun Park,et al.  New gene selection method for classification of cancer subtypes considering within‐class variation , 2003, FEBS letters.

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Trevor Hastie,et al.  Gene expression patterns in ovarian carcinomas. , 2003, Molecular biology of the cell.

[17]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[18]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[19]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.