Gene expression info-gibbs : a motif discovery algorithm that directly optimizes information content during sampling

MOTIVATION Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding affinity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. RESULTS We introduce here info-gibbs, a Gibbs sampling algorithm that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. AVAILABILITY http://rsat.ulb.ac.be/rsat/info-gibbs

[1]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[2]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[3]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[4]  Jacques van Helden,et al.  RSAT: regulatory sequence analysis tools , 2008, Nucleic Acids Res..

[5]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[6]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[7]  Kwong-Sak Leung,et al.  TFBS identification based on genetic algorithm with combined representations and adaptive post-processing , 2008, Bioinform..

[8]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[9]  G. Stormo Information content and free energy in DNA--protein interactions. , 1998, Journal of theoretical biology.

[10]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[11]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[12]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[13]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[14]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[15]  Kazuhito Shida,et al.  GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima , 2006, BMC Bioinformatics.

[16]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[17]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[18]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[19]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[20]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[21]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[22]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[23]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[24]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[25]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[26]  Julio Collado-Vides,et al.  RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation , 2007, Nucleic Acids Res..

[27]  Kazuhito Shida,et al.  Hybrid Gibbs-sampling algorithm for challenging motif discovery: GibbsDST. , 2006, Genome informatics. International Conference on Genome Informatics.

[28]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[29]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[30]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[31]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[32]  Martin Vingron,et al.  Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[33]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[34]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.