A Statistical Method for Finding Transcription Factor Binding Sites

Understanding the mechanisms that determine the regulation of gene expression is an important and challenging problem. A fundamental subproblem is to identify DNA-binding sites for unknown regulatory factors, given a collection of genes believed to be coregulated, and given the noncoding DNA sequences near those genes. We present an enumerative statistical method for identifying good candidates for such transcription factor binding sites. Unlike local search techniques such as Expectation Maximization and Gibbs samplers that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest z-scores. We discuss the results of experiments in which this algorithm was used to locate candidate binding sites in several well studied pathways of S. cerevisiae, as well as gene clusters from some of the hybridization microarray experiments.

[1]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[2]  D. Thiele,et al.  Cadmium tolerance mediated by the yeast AP-1 protein requires the presence of an ATP-binding cassette transporter-encoding gene, YCF1. , 1994, The Journal of biological chemistry.

[3]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[4]  Mireille Régnier,et al.  A unified approach to word statistics , 1998, RECOMB '98.

[5]  Mathieu Blanchette,et al.  Separating real motifs from their artifacts , 2001, ISMB.

[6]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[7]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[8]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[9]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[10]  R. Treisman,et al.  DNA binding specificity determinants in MADS-box transcription factors , 1995, Molecular and cellular biology.

[11]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[12]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[13]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[14]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[15]  Hanah Margalit,et al.  Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon , 1995, Comput. Appl. Biosci..

[16]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[17]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[18]  L. Breeden,et al.  A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G1-specific transcription. , 1997, Genes & development.

[19]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[20]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[21]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[22]  Sorin Istrail,et al.  Proceedings of the second annual international conference on Computational molecular biology , 1998, RECOMB 1998.

[23]  Philippe Flajolet,et al.  Motif Statistics , 1999, ESA.

[24]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[25]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[26]  Martin Tompa,et al.  An algorithm for finding novel gapped motifs in DNA sequences , 1998, RECOMB '98.

[27]  G. Yarrington Molecular Cell Biology , 1987, The Yale Journal of Biology and Medicine.

[28]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[29]  N. Ogawa,et al.  Regulation of phosphatase synthesis in Saccharomyces cerevisiae--a review. , 1996, Gene.

[30]  P. Blaiseau,et al.  Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism , 1997, Molecular and cellular biology.

[31]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[32]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[33]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[34]  Aris Floratos,et al.  Motif discovery without alignment or enumeration (extended abstract) , 1998, RECOMB '98.

[35]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..