A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length

MOTIVATION Transcription regulatory protein factors often bind DNA as homo-dimers or hetero-dimers. Thus they recognize structured DNA motifs that are inverted or direct repeats or spaced motif pairs. However, these motifs are often difficult to identify owing to their high divergence. The motif structure included explicitly into the motif recognition algorithm improves recognition efficiency for highly divergent motifs as well as estimation of motif geometric parameters. RESULT We present a modification of the Gibbs sampling motif extraction algorithm, SeSiMCMC (Sequence Similarities by Markov Chain Monte Carlo), which finds structured motifs of these types, as well as non-structured motifs, in a set of unaligned DNA sequences. It employs improved estimators of motif and spacer lengths. The probability that a sequence does not contain any motif is accounted for in a rigorous Bayesian manner. We have applied the algorithm to a set of upstream regions of genes from two Escherichia coli regulons involved in respiration. We have demonstrated that accounting for a symmetric motif structure allows the algorithm to identify weak motifs more accurately. In the examples studied, ArcA binding sites were demonstrated to have the structure of a direct spaced repeat, whereas NarP binding sites exhibited the palindromic structure. AVAILABILITY The WWW interface of the program, its FreeBSD (4.0) and Windows 32 console executables are available at http://bioinform.genetika.ru/SeSiMCMC

[1]  Michael R. Sawaya,et al.  Dimerization allows DNA target site recognition by the NarL response regulator , 2002, Nature Structural Biology.

[2]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[3]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[4]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[5]  J. Besag,et al.  Bayesian Computation and Stochastic Systems , 1995 .

[6]  Mikhail S. Gelfand,et al.  ArcA regulator of gamma-proteobacteria: Identification of the binding signal and description of the regulon , 2003 .

[7]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[8]  John Skilling,et al.  Data analysis : a Bayesian tutorial , 1996 .

[9]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[10]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[11]  A A Mironov,et al.  [Software for analyzing bacterial genomes]. , 2000, Molekuliarnaia biologiia.

[12]  Mikhail S. Gelfand,et al.  Comparative Analysis of Regulatory Patterns in Bacterial Genomes , 2000, Briefings Bioinform..

[13]  A. S. Lynch,et al.  Transcriptional control mediated by the ArcA two-component response regulator protein of Escherichia coli: characterization of DNA binding at target promoters , 1996, Journal of bacteriology.

[14]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[15]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[16]  V. Stewart,et al.  Expression of the narX, narL, narP, and narQ genes of Escherichia coli K-12: regulation of the regulators , 1995, Journal of bacteriology.

[17]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[18]  Anna G. Nazina,et al.  Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. , 2003, Nucleic acids research.

[19]  E. Koonin,et al.  Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. , 1999, Nucleic acids research.

[20]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[21]  Eric D. Siggia,et al.  Genome wide identification of regulatory motifs in Bacillus subtilis , 2003, BMC Bioinformatics.

[22]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[23]  Alexander E. Kel,et al.  TRANSCompel®: a database on composite regulatory elements in eukaryotic genes , 2002, Nucleic Acids Res..

[24]  C. Robert Discretization and Mcmc Convergence Assessment , 1998 .

[25]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[26]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[27]  S. Busby,et al.  Differential regulation by the homologous response regulators NarL and NarP of Escherichia coli K‐12 depends on DNA binding site arrangement , 1997, Molecular microbiology.

[28]  E C Lin,et al.  A weight matrix for binding recognition by the redox‐response regulator ArcA‐P of Escherichia coli , 1999, Molecular microbiology.

[29]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[30]  Luquan Wang,et al.  Computational analysis of composite regulatory elements , 2002, Mammalian Genome.

[31]  Jean-Jacques Daudin,et al.  Occurrence Probability of Structured Motifs in Random Sequences , 2002, J. Comput. Biol..

[32]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[33]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[34]  Xueqiao Liu,et al.  Probing the ArcA-P Modulon of Escherichia coli by Whole Genome Transcriptional Analysis and Sequence Recognition Profiling* , 2004, Journal of Biological Chemistry.

[35]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[36]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Eric D Siggia,et al.  Identification of the binding sites of regulatory proteins in bacterial genomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[38]  A C C Gibbs,et al.  Data Analysis , 2009, Encyclopedia of Database Systems.

[39]  Desh Ranjan,et al.  Computational Identification of Cis-regulatory Elements Associated with Pungency of Chili Peppers , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[40]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[41]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[42]  Eric S. Lander,et al.  Phylogenetically and spatially conserved word pairs associated with gene expression changes in yeasts , 2003, RECOMB.