Bayesian Clustering of Transcription Factor Binding Motifs

Genes are often regulated in living cells by proteins called transcription factors that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similarity of their nucleotide composition. We present a principled approach to this clustering problem based on a Bayesian hierarchical model that accounts for both within- and between-motif variability. We use a Dirichlet process prior distribution that allows the number of clusters to vary and we also present a novel generalization that allows the core width of each motif to vary. This clustering model is implemented, using a Gibbs sampling strategy, on several collections of transcription factor motif matrices. Our stochastic implementation allows us to examine the variability of our results in addition to focusing on a set of best clusters. Our clustering results identify several motif clusters that suggest that several transcription factor protein families are actually mixtures of several smaller groups of highly similar motifs, which provide substantially more refined information compared with the full set of motifs in the family. Our clusters provide a means by which to organize transcription factors based on binding motif similarities and can be used to reduce motif redundancy within large databases such as JASPAR and TRANSFAC, which aides the use of these databases for further motif discovery. Finally, our clustering procedure has been used in combination with discovery of evolutionarily conserved motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution is presented that differs substantially in terms of a priori clustering characteristics, but shows no substantive difference in the clustering results for our dataset. Despite our specific application to transcription factor binding motifs, our Bayesian clustering model based on the Dirichlet process has several advantages over traditional clustering methods that could make our procedure appropriate and useful for many clustering applications.

[1]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[2]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[3]  Shane T. Jensen,et al.  The sigmaE regulon and the identification of additional sporulation genes in Bacillus subtilis. , 2003, Journal of molecular biology.

[4]  T. Ferguson Prior Distributions on Spaces of Probability Measures , 1974 .

[5]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[6]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[7]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[8]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[9]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[10]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[11]  Jun S. Liu Nonparametric hierarchical Bayes via sequential imputations , 1996 .

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  Wing Hung Wong,et al.  Determination of Local Statistical Significance of Patterns in Markov Sequences with Application to Promoter Element Identification , 2004, J. Comput. Biol..

[14]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[15]  R. Fildes Journal of the American Statistical Association : William S. Cleveland, Marylyn E. McGill and Robert McGill, The shape parameter for a two variable graph 83 (1988) 289-300 , 1989 .

[16]  S. Levy,et al.  Predicting transcription factor synergism. , 2002, Nucleic acids research.

[17]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[18]  Lei Shen,et al.  Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes , 2005, Bioinform..

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[22]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[23]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[24]  David Baltimore,et al.  Regulation of Transcription Initiation , 2000 .

[25]  Szymon M. Kielbasa,et al.  Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[26]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[27]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .