Evolving Regular Expressions for GeneChip Probe Performance Prediction

Commercial GeneChips provide highly redundant but noisy data. Rapid identification and subsequent rejection of bad data effectively increases the quality of the remaining data at little cost whilst serving as a basis for better understanding the bio-physics of short surface mounted DNA sequences. Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. Regular expressions can be evolved from a Backus-Naur form (BNF) context-free grammar using tree based strongly typed genetic programming written in gawk. Fitness is given by egrep. The quality of individual HG-U133A probes is indicated by its correlation across 6685 human tissue samples from NCBI’s GEO database with other measurements for the same gene. Low concordance indicates a poor probe. The evolved data mined motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided. Section 4.6 gives more RE GP gawk implementation details.

[1]  Nicholas J. Radcliffe,et al.  Genetic Set Recombination , 1992, FOGA.

[2]  Kwong-Sak Leung,et al.  Evolving recursive functions for the even-parity problem using genetic programming , 1996 .

[3]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[4]  W. B. Langdon,et al.  Spatial Defects in 5896 HG-U 133 A GeneChips , 2007 .

[5]  W. B. Langdon,et al.  Genetic Programming and Data Structures , 1998, The Springer International Series in Engineering and Computer Science.

[6]  S. B. Atienza-Samols,et al.  With Contributions by , 1978 .

[7]  William B. Langdon Evolving GeneChip correlation predictors on parallel graphics hardware , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[8]  W. Langdon,et al.  G-spots cause incorrect expression measurement in Affymetrix microarrays , 2008, BMC Genomics.

[9]  Nguyen Xuan Hoai,et al.  Developmental Evaluation in Genetic Programming: The Preliminary Results , 2006, EuroGP.

[10]  Markus Brameier,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm066 Sequence analysis NucPred—Predicting nuclear localization of proteins , 2007 .

[11]  R. Krever,et al.  Hong Kong , 2012, Department of State publication. Background notes series.

[12]  Language Bias , 2010, Encyclopedia of Machine Learning.

[13]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[14]  Peter A. Whigham,et al.  Time series modeling using genetic programming: an application to rainfall-runoff models , 1999 .

[15]  Felix Naef,et al.  Reply to “Comment on ‘Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays’” , 2006 .

[16]  David J. Montana,et al.  Strongly Typed Genetic Programming , 1995, Evolutionary Computation.

[17]  Carsten Wiuf,et al.  Ab Initio Identification of Human Micrornas Based on Structure Motifs Ab Initio Identification of Human Micrornas Based on Struc- Ture Motifs , 2007 .

[18]  David B. Fogel,et al.  Evolutionary algorithms in theory and practice , 1997, Complex.

[19]  William B. Langdon,et al.  Genetic Programming in Data Mining for Drug Discovery , 2005 .

[20]  Nikolay I. Nikolaev,et al.  Concepts of Inductive Genetic Programming , 1998, EuroGP.

[21]  J. Davenport Editor , 1960 .

[22]  Riccardo Poli,et al.  Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications , 2008, Computational Intelligence: A Compendium.

[23]  Conor Ryan,et al.  Grammatical evolution , 2007, GECCO '07.

[24]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[25]  William B. Langdon,et al.  A Survey of Spatial Defects in Homo Sapiens Affymetrix GeneChips , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Alex A. Freitas,et al.  Evolutionary Computation , 2002 .

[27]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[28]  William B. Langdon,et al.  Probes containing runs of guanines provide insights into the biophysics and bioinformatics of Affymetrix GeneChips , 2008, Briefings Bioinform..

[29]  William B. Langdon,et al.  Repeated Sequences in Linear Genetic Programming Genomes , 2005, Complex Syst..

[30]  G. Barkema,et al.  Comment on "Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays". , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Peter A. Whigham,et al.  Search bias, language bias and genetic programming , 1996 .

[32]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[33]  Brian J. Ross,et al.  The evaluation of a stochastic regular motif language for protein sequences , 2001 .

[34]  W. Langdon,et al.  Affymetrix probes containing runs of contiguous guanines are not gene-specific , 2008 .

[35]  William B. Langdon,et al.  Evolving Receiver Operating Characteristics for Data Fusion , 2001, EuroGP.

[36]  Thomas Bäck,et al.  Evolutionary algorithms in theory and practice - evolution strategies, evolutionary programming, genetic algorithms , 1996 .

[37]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[38]  Hans-Georg Beyer,et al.  The Theory of Evolution Strategies , 2001, Natural Computing Series.

[39]  Conor Ryan,et al.  Grammatical evolution , 2001, IEEE Trans. Evol. Comput..