On the use of genetic programming for the prediction of survival in cancer

The classification of cancer patients into risk classes is a very active field of research, with direct clinical applications. We have recently compared several machine learning methods on the well known 70-genes signature dataset. In that study, genetic programming showed promising results, given that it outperformed all the other techniques. Nevertheless, the study was preliminary, mainly because the validation dataset was preprocessed and all its features binarized in order to use logical operators for the genetic programming functional nodes. If this choice allowed simple interpretation of the solutions from the biological viewpoint, on the other hand the binarization of data was limiting, since it amounts to a sizable loss of information. The goal of this paper is to overcome this limitation, using the 70-genes signature dataset with real-valued expression data. The results we present show that genetic programming using the number of incorrectly classified instances as fitness function is not able to outperform the other machine learning methods. However, when a weighted average between false positives and false negatives is used to calculate fitness values, genetic programming obtains performances that are comparable with the other methods in the minimization of incorrectly classified instances and outperforms all the other methods in the minimization of false negatives, which is one of the main goals in breast cancer clinical applications. Also in this case, the solutions returned by genetic programming are simple, easy to understand, and they use a rather limited subset of the available features.

[1]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[2]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[3]  Debashis Ghosh,et al.  Feature selection and molecular classification of cancer using genetic programming. , 2007, Neoplasia.

[4]  Leonardo Vanneschi,et al.  Genetic programming for computational pharmacokinetics in drug discovery and development , 2007, Genetic Programming and Evolvable Machines.

[5]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[6]  Leonardo Vanneschi,et al.  Genetic Programming and Other Machine Learning Approaches to Predict Median Oral Lethal Dose (LD50) and Plasma Protein Binding Levels (%PPB) of Drugs , 2007, EvoBIO.

[7]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  J. Nevins,et al.  Mining gene expression profiles: expression signatures as cancer phenotypes , 2007, Nature Reviews Genetics.

[11]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[12]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[13]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[14]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[15]  Alex Alves Freitas,et al.  A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets , 2004, Artif. Intell. Medicine.

[16]  Alex A. Freitas,et al.  Data Mining with Constrained-syntax Genetic Programming: Applications in Medical Data Sets , 2001 .

[17]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Leonardo Vanneschi,et al.  Genetic programming for human oral bioavailability of drugs , 2006, GECCO.

[19]  Jason H. Moore,et al.  Symbolic Discriminant Analysis for Mining Gene Expression Patterns , 2001, ECML.

[20]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[21]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[22]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[23]  Sara Silva,et al.  GPLAB A Genetic Programming Toolbox for MATLAB , 2004 .

[24]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[25]  Giancarlo Mauri,et al.  Identification of Individualized Feature Combinations for Survival Prediction in Breast Cancer: A Comparison of Machine Learning Techniques , 2010, EvoBIO.

[26]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[27]  Saman K. Halgamuge,et al.  An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data , 2003, Bioinform..

[28]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[29]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[30]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[31]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[32]  Sung-Bae Cho,et al.  The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming , 2006, Artif. Intell. Medicine.

[33]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[34]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .