Modeling evolutionary fitness for DNA motif discovery

The motif discovery problem consists of finding over-represented patterns in a collection of sequences. Its difficulty stems partly from the large number of possibilities to define both the motif space to be searched and the notion of over-representation. Since the size of the search space is generally exponential in the motif length, many heuristic methods, including evolutionary algorithms, have been developed. However, comparatively little attention has been devoted to the adequate evaluation of motif quality, especially when comparing motifs of different lengths. We propose an evolution strategy to solve the motif discovery problem based on a new fitness function that simultaneously takes into account (1) the number of motif occurrences, (2) the motif length, and (3) its information content. Experimental results show that the proposed method succeeds in uncovering the correct motif positions and length with high accuracy.

[1]  Mehmet Kaya Motif Discovery Using Multi-Objective Genetic Algorithm in Biosequences , 2007, IDA.

[2]  Carolyn J. Mattingly,et al.  An Evaluation of Information Content as a Metric for the Inference of Putative Conserved Noncoding Regions in DNA Sequences Using a Genetic Algorithms Approach , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[4]  Hans-Paul Schwefel,et al.  Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[5]  Hitoshi Iba,et al.  Identification of weak motifs in multiple biological sequences using genetic algorithm , 2006, GECCO.

[6]  J. Baumbach,et al.  The GlxR regulon of the amino acid producer Corynebacterium glutamicum: in silico and in vitro detection of DNA binding sites of a global transcription regulator. , 2008, Journal of biotechnology.

[7]  A. Goesmann,et al.  The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. , 2003, Journal of biotechnology.

[8]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[9]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[10]  Serafim Batzoglou,et al.  MotifCut: regulatory motifs finding with maximum density subgraphs , 2006, ISMB.

[11]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[12]  Kwong-Sak Leung,et al.  TFBS identification based on genetic algorithm with combined representations and adaptive post-processing , 2008, Bioinform..

[13]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[14]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[15]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[16]  G. Fogel,et al.  Discovery of sequence motifs related to coexpression of genes using evolutionary computation. , 2004, Nucleic acids research.

[17]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[18]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[21]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[22]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[23]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[24]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[25]  Andrew M. Tyrrell,et al.  The evolutionary computation approach to motif discovery in biological sequences , 2005, GECCO '05.

[26]  Andrew M. Tyrrell,et al.  Regulatory Motif Discovery Using a Population Clustering Evolutionary Algorithm , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.