Classification of oncologic data with genetic programming

Discovering the models explaining the hidden relationship between genetic material and tumor pathologies is one of the most important open challenges in biology and medicine. Given the large amount of data made available by the DNA Microarray technique, Machine Learning is becoming a popular tool for this kind of investigations. In the last few years, we have been particularly involved in the study of Genetic Programming for mining large sets of biomedical data. In this paper, we present a comparison between four variants of Genetic Programming for the classification of two different oncologic datasets: the first one contains data from healthy colon tissues and colon tissues affected by cancer; the second one contains data from patients affected by two kinds of leukemia (acute myeloid leukemia and acute lymphoblastic leukemia). We report experimental results obtained using two different fitness criteria: the receiver operating characteristic and the percentage of correctly classified instances. These results, and their comparison with the ones obtained by three nonevolutionary Machine Learning methods (Support Vector Machines, MultiBoosting, and Random Forests) on the same data, seem to hint that Genetic Programming is a promising technique for this kind of classification.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Sung-Bae Cho,et al.  The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming , 2006, Artif. Intell. Medicine.

[4]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[5]  Wolfgang Banzhaf,et al.  Genetic Programming based DNA Microarray Analysis for Classification of Cancer , 2007 .

[6]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[7]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[10]  Alex A. Freitas,et al.  Data Mining with Constrained-syntax Genetic Programming: Applications in Medical Data Sets , 2001 .

[11]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[16]  Peter J. Russell,et al.  Fundamentals of Genetics , 1994 .

[17]  Leonardo Vanneschi,et al.  Multi-optimization for generalization in symbolic regression using genetic programming , 2007 .

[18]  Debashis Ghosh,et al.  Feature selection and molecular classification of cancer using genetic programming. , 2007, Neoplasia.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[21]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[22]  Maarten Keijzer,et al.  Scaled Symbolic Regression , 2004, Genetic Programming and Evolvable Machines.

[23]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[24]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[25]  Saman K. Halgamuge,et al.  An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data , 2003, Bioinform..

[26]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[27]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[28]  Jason H. Moore,et al.  Symbolic Discriminant Analysis for Mining Gene Expression Patterns , 2001, ECML.

[29]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..