A New Data Mining Approach for the Detection of Bacterial Promoters Combining Stochastic and Combinatorial Methods

We present a new data mining method based on stochastic analysis (Hidden Markov Model [HMM]) and combinatorial methods for discovering new transcriptional factors in bacterial genome sequences. Sigma factor binding sites (SFBSs) were described as patterns of box1-spacer-box2 corresponding to the -35 and -10 DNA motifs of bacterial promoters. We used a high-order HMM in which the hidden process is a second-order HMM chain. Applied on the genome of the model bacterium Streptomyces coelicolor A3(2), the a posteriori state probabilities revealed local maxima or peaks whose distribution was enriched in the intergenic sequences ("iPeaks" for intergenic peaks). Short DNA sequences underlying the iPeaks were extracted and clustered by a hierarchical classification algorithm based on the SmithWaterman local similarity. Some selected motif consensuses were used as box1 (-35 motif ) in the search of a potential neighbouring box2 (-10 motif ) using a word enumeration algorithm. This new SFBS mining methodology applied on Streptomyces coelicolor was successful to retrieve already known SFBSs and to suggest new potential transcriptional factor binding sites (TFBSs). The well-defined SigR regulon (oxidative stress response) was also used as a test quorum to compare first- and second-order HMM. Our approach also allowed the preliminary detection of known SFBSs in Bacillus subtilis.

[1]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[2]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[3]  M. Buttner,et al.  σR, an RNA polymerase sigma factor that modulates expression of the thioredoxin system in response to oxidative stress in Streptomyces coelicolor A3(2) , 1998 .

[4]  Jean-François Mari,et al.  Intragenomic reiterations detection using hidden Markov models , 2002, ISMB 2002.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  Raphaël Marée,et al.  PREDetector: a new tool to identify regulatory elements in bacterial genomes. , 2007, Biochemical and biophysical research communications.

[7]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[8]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[9]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[10]  M. Buttner,et al.  Defining the disulphide stress response in Streptomyces coelicolor A3(2): identification of the σR regulon , 2001, Molecular Microbiology.

[11]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[12]  Yoshiyuki Sakaki,et al.  Complete genome sequence and comparative analysis of the industrial microorganism Streptomyces avermitilis , 2003, Nature Biotechnology.

[13]  Yang He Extended Viterbi algorithm for second order hidden Markov process , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[14]  Eric D Siggia,et al.  Identification of the binding sites of regulatory proteins in bacterial genomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Jean-François Mari,et al.  Temporal and spatial data mining with second-order hidden markov models , 2006, Soft Comput..

[16]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[17]  Patricia Rodriguez-Tomé,et al.  The European Bioinformatics Institute (EBI) databases , 1994, Nucleic Acids Res..

[18]  Mark J. Buttner,et al.  The developmental fate of S. coelicolor hyphae depends upon a gene product homologous with the motility σ factor of B. subtilis , 1989, Cell.

[19]  Marie-France Sagot,et al.  Efficient Extraction of Structured Motifs Using Box-Links , 2004, SPIRE.

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[22]  Philippe Bessières,et al.  Searching gene transfers on Bacillus subtilis using hidden Markov models , 1999, RECOMB.

[23]  Johan A. du Preez Efficient training of high-order hidden Markov models using first-order representations , 1998, Comput. Speech Lang..

[24]  Kim Rutherford,et al.  Artemis: sequence visualization and annotation , 2000, Bioinform..

[25]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[26]  Nikolaus Rajewsky,et al.  The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. , 2002, Genome research.

[27]  Regine Hengge,et al.  Differential ability of σs and σ70 of Escherichia coli to utilize promoters containing half or full UP‐element sites , 2004 .

[28]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[29]  Mark Hoebeke,et al.  Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. , 2002, Nucleic acids research.

[30]  Pierre-Étienne Jacques,et al.  Detection of prokaryotic promoters from the genomic distribution of hexanucleotide pairs , 2006, BMC Bioinformatics.

[31]  J. Collado-Vides,et al.  Conservation of DNA curvature signals in regulatory regions of prokaryotic genes. , 2003, Nucleic acids research.

[32]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[33]  R. Zhang,et al.  Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences , 2004 .

[34]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[35]  S Brunak,et al.  Sigma A recognition sites in the Bacillus subtilis genome. , 2001, Microbiology.

[36]  J. Roe,et al.  SigB, an RNA polymerase sigma factor required for osmoprotection and proper differentiation of Streptomyces coelicolor , 2001, Molecular microbiology.

[37]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[38]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[39]  K. Chater,et al.  A response regulator‐like protein that functions at an intermediate stage of sporulation in Streptomyces coelicolor A3(2) , 1999, Molecular microbiology.

[40]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[41]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[42]  M. Sagot,et al.  Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. , 2000, Journal of molecular biology.

[43]  M. Buttner,et al.  A putative two‐component signal transduction system regulates σE, a sigma factor required for normal cell wall integrity in Streptomyces coelicolor A3(2) , 1999, Molecular microbiology.

[44]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[45]  Dieter Jahn,et al.  Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes , 2005, Bioinform..

[46]  Roded Sharan,et al.  A Discriminative Model for Identifying Spatial cis-Regulatory Modules , 2005, J. Comput. Biol..

[47]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[48]  T Ha-Duong,et al.  Flexibility of the B-DNA backbone: effects of local and neighbouring sequences on pyrimidine-purine steps. , 1998, Nucleic acids research.

[49]  J. van Helden,et al.  Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. , 2000, Nucleic acids research.

[50]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[51]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[52]  L. Pachter,et al.  rVista for comparative sequence-based discovery of functional transcription factor binding sites. , 2002, Genome research.

[53]  L. Servín-González,et al.  Transcriptional regulation of the four promoters of the agarase gene (dagA) of Streptomyces coelicolor A3(2). , 1994, Microbiology.

[54]  Isabelle Debled-Rennesson,et al.  SIGffRid: A tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics , 2008, BMC Bioinformatics.

[55]  K. Chater,et al.  A developmentally regulated gene encoding a repressor‐like protein is essential for sporulation in Streptomyces coelicolor A3(2) , 1998, Molecular microbiology.

[56]  Abdelaziz Kriouile,et al.  Automatic word recognition based on second-order hidden Markov models , 1994, IEEE Trans. Speech Audio Process..

[57]  Jean-Jacques Daudin,et al.  Occurrence Probability of Structured Motifs in Random Sequences , 2002, J. Comput. Biol..

[58]  Kenta Nakai,et al.  BTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics , 2004, Nucleic Acids Res..

[59]  M. Lonetto,et al.  A new RNA polymerase sigma factor, σF is required for the late stages of morphological differentiation in Streptomyces spp. , 1995, Molecular microbiology.

[60]  Mark J. Buttner,et al.  At least three different RNA polymerase holoenzymes direct transcription of the agarase gene (dagA) of streptomyces coelicolor A3(2) , 1988, Cell.

[61]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[62]  Manju Bansal,et al.  Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability , 2007, Journal of Biosciences.

[63]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[64]  James M. Hogan,et al.  Improved prediction of bacterial transcription start sites , 2006, Bioinform..

[65]  Jean-François Mari,et al.  Studying crop sequences with CarrotAge, a HMM-based data mining software , 2006 .

[66]  Anders Krogh,et al.  RpoD promoters in Campylobacter jejuni exhibit a strong periodic signal instead of a -35 box. , 2003, Journal of molecular biology.

[67]  Masato Ishikawa,et al.  Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences , 1998, Bioinform..

[68]  B. Barrell,et al.  Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2) , 2002, Nature.

[69]  M. Buttner,et al.  ςBldN, an Extracytoplasmic Function RNA Polymerase Sigma Factor Required for Aerial Mycelium Formation in Streptomyces coelicolor A3(2) , 2000, Journal of bacteriology.

[70]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[71]  R. Overbeek,et al.  Searching for patterns in genomic data. , 1997, Trends in genetics : TIG.

[72]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[73]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[74]  David J Studholme,et al.  Bmc Microbiology Bioinformatic Identification of Novel Regulatory Dna Sequence Motifs in Streptomyces Coelicolor , 2004 .