Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data

In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.

[1]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Walter L. Ruzzo,et al.  Bayesian Classification of DNA Array Expression Data , 2000 .

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[8]  Peter J. Park,et al.  A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data , 2000, Pacific Symposium on Biocomputing.

[9]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[10]  Hitoshi Iba,et al.  Identification of Informative Genes for Molecular Classification Using Probabilistic Model Building Genetic Algorithm , 2004, GECCO.

[11]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[13]  Jason H. Moore,et al.  Symbolic discriminant analysis of microarray data in autoimmune disease , 2002, Genetic epidemiology.

[14]  L. Marlow,et al.  Alzheimer's Disease β-Amyloid Peptide Is Increased in Mice Deficient in Endothelin-converting Enzyme* , 2003, The Journal of Biological Chemistry.

[15]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[17]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[18]  Jie Yang,et al.  Degree prediction of malignancy in brain glioma using support vector machines , 2006, Comput. Biol. Medicine.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[21]  H. Iba,et al.  Gene selection for classification of cancers using probabilistic model building genetic algorithm. , 2005, Bio Systems.

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Pedro Larrañaga,et al.  Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS , 2005, J. Biomed. Informatics.

[26]  Wolfgang Banzhaf,et al.  Genetic Programming: An Introduction , 1997 .

[27]  Paul Terry,et al.  Application of the GA/KNN method to SELDI proteomics data , 2004, Bioinform..

[28]  G. Fuller,et al.  Insulin-like growth factor binding protein 2 enhances glioblastoma invasion by activating invasion-enhancing genes. , 2003, Cancer research.

[29]  Joseph A. Driscoll,et al.  Classification of Gene Expression Data with Genetic Programming , 2003 .

[30]  Li Shen,et al.  A Generalized Output-Coding Scheme with SVM for Multiclass Microarray Classification , 2005, APBC.

[31]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[32]  Xin Yao,et al.  Automatic Discovery of Protein Motifs Using Genetic Programming , 2004 .

[33]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[34]  Hitoshi Iba,et al.  Classification of Gene Expression Profile Using Combinatory Method of Evolutionary Computation and Machine Learning , 2004, Genetic Programming and Evolvable Machines.

[35]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[36]  J. M. Deutsch,et al.  Evolutionary algorithms for finding optimal gene sets in microarray prediction , 2003, Bioinform..

[37]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[38]  Hitoshi Iba,et al.  Selection of the most useful subset of genes for gene expression-based classification , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[39]  William B. Langdon,et al.  Genetic Programming for Mining DNA Chip Data from Cancer Patients , 2004, Genetic Programming and Evolvable Machines.

[40]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[41]  Chris H. Q. Ding,et al.  Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis , 2003, Bioinform..

[42]  Roy S Herbst,et al.  Synchronous Overexpression of Epidermal Growth Factor Receptor and HER2-neu Protein Is a Predictor of Poor Outcome in Patients with Stage I Non-Small Cell Lung Cancer , 2004, Clinical Cancer Research.

[43]  William Perrizo,et al.  Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis , 2004, J. Biomed. Informatics.

[44]  Ed Keedwell,et al.  Genetic Algorithms for Gene Expression Analysis , 2003, EvoWorkshops.

[45]  Hitoshi Iba,et al.  Extraction of informative genes from microarray data , 2005, GECCO '05.

[46]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[47]  Karen Billeci,et al.  Hepsin activates pro‐hepatocyte growth factor and is inhibited by hepatocyte growth factor activator inhibitor‐1B (HAI‐1B) and HAI‐2 , 2005, FEBS letters.

[48]  A. Gown,et al.  p63 Expression in Lung Carcinoma: A Tissue Microarray Study of 408 Cases , 2004, Applied immunohistochemistry & molecular morphology : AIMM.

[49]  Sung-Bae Cho,et al.  Lymphoma Cancer Classification Using Genetic Programming with SNR Features , 2004, EuroGP.

[50]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[51]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.