Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes

MOTIVATION We present a sequence-based framework and algorithm PHYLOCLUS for predicting co-regulated genes. In our approach, de novo discovery methods are used to find motifs conserved by evolution and then a Bayesian hierarchical clustering model is used to cluster these motifs, thereby grouping together genes that are putatively co-regulated. Our clustering procedure allows both the number of clusters and the motif width within each cluster to be unknown. RESULTS We use our framework to predict co-regulated genes in the bacterium Bacillus subtilis using six other closely related bacterial species. Our predicted motifs and gene clusters are validated using several external sources and significant clusters are examined in detail. An extension to the discovery and clustering of two-block motifs can be used for inference about synergistic binding relationships between transcription factors. AVAILABILITY Software and Supplementary Materials can be downloaded at http://stat.wharton.upenn.edu/~stjensen/research/phyloclus.html or http://www.fas.harvard.edu/~junliu/phyloclus.html CONTACT stjensen@wharton.upenn.edu.

[1]  C. Lawrence,et al.  Factors influencing the identification of transcription factor binding sites by cross-species comparison. , 2002, Genome research.

[2]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[3]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[4]  Shane T. Jensen,et al.  The Program of Gene Transcription for a Single Differentiating Cell Type during Sporulation in Bacillus subtilis , 2004, PLoS biology.

[5]  J. Shine,et al.  The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[7]  Hanne Jarmer,et al.  Definition of the Bacillus subtilisPurR Operator Using Genetic and Bioinformatic Tools and Expansion of the PurR Regulon with glyA, guaC,pbuG, xpt-pbuX, yqhZ-folD, and pbuO , 2001, Journal of bacteriology.

[8]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[9]  Shane T. Jensen,et al.  The sigmaE regulon and the identification of additional sporulation genes in Bacillus subtilis. , 2003, Journal of molecular biology.

[10]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[11]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[12]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[13]  Tao Wang,et al.  Functional Analysis of the Bacillus subtilis Zur Regulon , 2002, Journal of bacteriology.

[14]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[15]  Jun S. Liu,et al.  Determining and analyzing differentially expressed genes from cDNA microarray experiments with complementary designs , 2004 .

[16]  A Danchin,et al.  SubtiList: a relational database for the Bacillus subtilis genome. , 1995, Microbiology.

[17]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[18]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[19]  Kenta Nakai,et al.  BTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics , 2004, Nucleic Acids Res..