A Genetic Algorithm Approach for Discovering Diagnostic Patterns in Molecular Measurement Data

The objective of this work is the development of an algorithm that, after training, will be able to discriminate between disease classes in molecular data. The system proposed uses a genetic algorithm (GA) to achieve this discrimination. We apply our method to three publicly available data sets. Two of the data sets are based on microarray data that allow the simultaneous measurement of the expression levels of genes under different disease states. The third data set is based on serum proteomic pattern diagnostics of ovarian cancer using high-resolution mass spectrometry to extract a set of biomarker classifiers. We show how our methodology finds an abundance of different feature models, automatically selecting a subset of discriminatory features, whose classification accuracy is comparable to other approaches considered. This raises questions about how to choose among the many competing models, while simultaneously estimating the prediction accuracy of the chosen models.

[1]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[2]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[3]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[4]  Ash A. Alizadeh,et al.  Individuality and variation in gene expression patterns in human blood , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  David West,et al.  A comparison of SOM neural network and hierarchical clustering methods , 1996 .

[6]  Paul F. Hoogendijk,et al.  Code Compaction Using Genetic Algorithms , 2000, GECCO.

[7]  Jeffrey S. Morris,et al.  Bias, Randomization, and Ovarian Proteomic Data: A Reply to “Producers and Consumers” , 2005 .

[8]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[9]  Samuel Kaski,et al.  Bibliography of Self-Organizing Map (SOM) Papers: 1981-1997 , 1998 .

[10]  Paul Terry,et al.  Application of the GA/KNN method to SELDI proteomics data , 2004, Bioinform..

[11]  D. Chen,et al.  Breast cancer diagnosis using self-organizing map for sonography. , 2000, Ultrasound in medicine & biology.

[12]  Neal O. Jeffries,et al.  Performance of a genetic algorithm for mass spectrometry proteomics , 2004, BMC Bioinformatics.

[13]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[14]  Michael L. Bittner,et al.  Strong Feature Sets from Small Samples , 2002, J. Comput. Biol..

[15]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[17]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[18]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  Emanuel F. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004 .

[21]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  A. Robinson I. Introduction , 1991 .

[24]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[25]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[26]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[27]  Kalyanmoy Deb,et al.  Identification of Multiple Gene Subsets Using Multi-objective Evolutionary Algorithms , 2003, EMO.

[28]  Larry J. Eshelman,et al.  Crossover's Niche , 1993, ICGA.