Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles

Recently developed Serial Analysis of Gene Expression (SAGE) technology enables us to simultaneously quantify the expression levels of tens of thousands of genes in a population of cells. SAGE is better than Microarray in that SAGE can monitor both known and unknown genes while Microarray can only measure known genes. SAGE gene expression profiling based cancer classification is a better choice since cancers may be due to some unknown genes. Whereas a wide range of methods has been applied to traditional Microarray based cancer classification, relatively few studies have been done on SAGE based cancer classification. In our study we evaluate popular machine learning methods (SVM, Naive Bayes, Nearest Neighbor, C4.5 and RIPPER) for classifying cancers based on SAGE data. In order to deal with the high dimensional problem, we propose to use Chi-square for tag/gene selection. Both binary classification and multicategory classification are investigated. The experiments are based on two human SAGE datasets: brain and breast. The results show that SVM and Naive Bayes are the top-performing SAGE classifiers and that Chi-square based gene selection can improve the performance of all the five classifiers investigated.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Jun S. Liu,et al.  Clustering analysis of SAGE data using a Poisson approach , 2004, Genome Biology.

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[12]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[13]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[14]  Jennifer Chan,et al.  A neural survival factor is a candidate oncogene in breast cancer , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[16]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[17]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[18]  Jörg Sander,et al.  Hierarchical cluster analysis of SAGE data for cancer profiling , 2001, BIOKDD.

[19]  Tian-Li Wang,et al.  Identifying tumor origin using a gene expression-based classification map. , 2003, Cancer research.

[20]  Steven J. M. Jones,et al.  A methodology for analyzing SAGE libraries for cancer profiling , 2005, TOIS.

[21]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[22]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.