M are better than one: an ensemble-based motif finder and its application to regulatory element prediction

MOTIVATION Identifying regulatory elements in genomic sequences is a key component in understanding the control of gene expression. Computationally, this problem is often addressed by motif discovery, where the goal is to find a set of mutually similar subsequences within a collection of input sequences. Though motif discovery is widely studied and many approaches to it have been suggested, it remains a challenging and as yet unresolved problem. RESULTS We introduce SAMF (Solution-Aggregating Motif Finder), a novel approach for motif discovery. SAMF is based on a Markov Random Field formulation, and its key idea is to uncover and aggregate multiple statistically significant solutions to the given motif finding problem. In contrast to many earlier methods, SAMF does not require prior estimates on the number of motif instances present in the data, is not limited by motif length, and allows motifs to overlap. Though SAMF is broadly applicable, these features make it particularly well suited for addressing the challenges of prokaryotic regulatory element detection. We test SAMF's ability to find transcription factor binding sites in an Escherichia coli dataset and show that it outperforms previous methods. Additionally, we uncover a number of previously unidentified binding sites in this data, and provide evidence that they correspond to actual regulatory elements. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Siu-Ming Yiu,et al.  MotifVoter: a novel ensemble method for fine-grained integration of generic motif finders , 2008, Bioinform..

[2]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[3]  Ann M. Stock,et al.  Structural Analysis and Solution Studies of the Activated Regulatory Domain of the Response Regulator ArcA: A Symmetric Dimer Mediated by the α4-β5-α5 Face , 2005 .

[4]  Yair Weiss,et al.  Approximate Inference and Protein-Folding , 2002, NIPS.

[5]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[6]  V. Schreiber,et al.  Oligomeric assemblies of the Escherichia coli MalT transcriptional activator revealed by cryo-electron microscopy and image processing. , 2004, Journal of molecular biology.

[7]  Ajay N. Jain,et al.  A deterministic motif finding algorithm with application to the human genome , 2006, Bioinform..

[8]  Mona Singh,et al.  A combinatorial optimization approach for diverse motif finding applications , 2006, Algorithms for Molecular Biology.

[9]  Byung-Kwan Cho,et al.  Transcriptional regulation of the fad regulon genes of Escherichia coli by ArcA. , 2006, Microbiology.

[10]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[11]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[12]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[13]  Daisuke Kihara,et al.  EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[14]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[15]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[16]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[17]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[18]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[20]  M. Solà,et al.  Tandem DNA recognition by PhoB, a two-component signal transduction transcriptional activator. , 2002, Structure.

[21]  Pieter Rein ten Wolde,et al.  Transcriptional Regulation by Competing Transcription Factor Modules , 2006, PLoS Comput. Biol..

[22]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[23]  Y. Weiss,et al.  Finding the M Most Probable Configurations using Loopy Belief Propagation , 2003, NIPS 2003.

[24]  Panayiotis V. Benos,et al.  STAMP: a web tool for exploring DNA-binding motif similarities , 2007, Nucleic Acids Res..

[25]  Peter D. Karp,et al.  Multidimensional annotation of the Escherichia coli K-12 genome , 2007, Nucleic acids research.

[26]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[27]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[28]  Menachem Fromer,et al.  Accurate prediction for atomic‐level protein design and its application in diversifying the near‐optimal sequence space , 2009, Proteins.

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[31]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[32]  R. Young,et al.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays , 2004, Nature Genetics.

[33]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[34]  Lee Ann McCue,et al.  Measuring Global Credibility with Application to Local Sequence Alignment , 2008, PLoS Comput. Biol..

[35]  Charles DeLisi,et al.  Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites , 2007, PLoS Comput. Biol..

[36]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.