Challenges rising from learning motif evaluation functions using genetic programming

Motif discovery is an important Bioinformatics problem for deciphering gene regulation. Numerous sequence-based approaches have been proposed employing human specialist motif models (evaluation functions), but performance is so unsatisfactory on benchmarks that the underlying information seems to have already been exploited and have doomed. However, we have found that even a simple modified representation still achieves considerably high performance on a challenging benchmark, implying potential for sequence-based motif discovery. Thus we raise the problem of learning motif evaluation functions. We employ Genetic programming (GP) which has the potential to evolve human competitive models. We take advantage of the terminal set containing specialist-model-like components and have tried three fitness functions. Results exhibit both great challenges and potentials. No models learnt can perform universally well on the challenging benchmark, where one reason may be the data appropriateness for sequence-based motif discovery. However, when applied on different widely-tested datasets, the same models achieve comparable performance to existing approaches based on specialist models. The study calls for further novel GP to learn different levels of effective evaluation models from strict to loose ones on exploiting sequence information for motif discovery, namely quantitative functions, cardinal rankings, and learning feasibility classifications.

[1]  S. B. Atienza-Samols,et al.  With Contributions by , 1978 .

[2]  Gang Li,et al.  Discovering multiple realistic TFBS motifs based on a generalized model , 2009, BMC Bioinformatics.

[3]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[4]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[5]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[6]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[7]  Weiguo Fan,et al.  Nonlinear ranking function representations in genetic programming-based ranking discovery for personalized search , 2006, Decis. Support Syst..

[8]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[9]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[12]  Hitoshi Iba,et al.  Identification of weak motifs in multiple biological sequences using genetic algorithm , 2006, GECCO.

[13]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[14]  Ole Winther,et al.  Discovery of Regulatory Elements is Improved by a Discriminatory Approach , 2009, PLoS Comput. Biol..

[15]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[16]  Kwong-Sak Leung,et al.  TFBS identification based on genetic algorithm with combined representations and adaptive post-processing , 2008, Bioinform..

[17]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[18]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[19]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[20]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[21]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.

[22]  Alexander J. Hartemink,et al.  Informative priors based on transcription factor structural class improve de novo motif discovery , 2006, ISMB.

[23]  Oliver Kramer,et al.  Modeling evolutionary fitness for DNA motif discovery , 2009, GECCO.

[24]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[25]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[26]  M. M. Garner,et al.  A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system , 1981, Nucleic Acids Res..

[27]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[28]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..