Classification of Gene Expression Data with Genetic Programming

This paper summarizes the use of a genetic programming (GP) system to develop classification rules for gene expression data that hold promise for the development of new molecular diagnostics. This work focuses on discovering simple, accurate rules that diagnose diseases based on changes of gene expression profiles within a diseased cell. GP is shown to be a useful technique for discovering classification rules in a supervised learning mode where the biological genotype is paired with a biological phenotype such as a disease state. In the process of developing these rules, it is necessary to devise new techniques for establishing fitness and interpreting the results of evolutionary runs because of the large number of independent variables and the comparatively small number of samples. These techniques are described and issues of overfitting caused by small sample sizes and the behavior of the GP system when variables are missing from the samples are discussed.

[1]  Hans-Werner Mewes,et al.  MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[2]  Astro Teller,et al.  PADO: Learning Tree Structured Algorithms for Orchestration into an Object Recognition System , 1995 .

[3]  Derek S. Linden,et al.  Evolving wire antennas using genetic algorithms: a review , 1999, Proceedings of the First NASA/DoD Workshop on Evolvable Hardware.

[4]  N. Hopper,et al.  Analysis of genetic diversity through population history , 1999 .

[5]  G. Raidl A Hybrid GP Approach for Numerically Robust Symbolic Regression , 2002 .

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[8]  Sean Luke,et al.  Is The Perfect The Enemy Of The Good? , 2002, GECCO.

[9]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Alex A. Freitas,et al.  Data Mining with Constrained-syntax Genetic Programming: Applications in Medical Data Sets , 2001 .

[11]  D. Gerhold,et al.  DNA chips: promising toys have become powerful tools. , 1999, Trends in biochemical sciences.

[12]  Mark J. Willis,et al.  Using a tree structured genetic algorithm to perform symbolic regression , 1995 .

[13]  Arthur Tay,et al.  Mining multiple comprehensible classification rules using genetic programming , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).