Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites

Computational prediction of nucleotide binding specificity for transcription factors remains a fundamental and largely unsolved problem. Determination of binding positions is a prerequisite for research in gene regulation, a major mechanism controlling phenotypic diversity. Furthermore, an accurate determination of binding specificities from high-throughput data sources is necessary to realize the full potential of systems biology. Unfortunately, recently performed independent evaluation showed that more than half the predictions from most widely used algorithms are false. We introduce a graph-theoretical framework to describe local sequence similarity as the pair-wise distances between nucleotides in promoter sequences, and hypothesize that densely connected subgraphs are indicative of transcription factor binding sites. Using a well-established sampling algorithm coupled with simple clustering and scoring schemes, we identify sets of closely related nucleotides and test those for known TF binding activity. Using an independent benchmark, we find our algorithm predicts yeast binding motifs considerably better than currently available techniques and without manual curation. Importantly, we reduce the number of false positive predictions in yeast to less than 30%. We also develop a framework to evaluate the statistical significance of our motif predictions. We show that our approach is robust to the choice of input promoters, and thus can be used in the context of predicting binding positions from noisy experimental data. We apply our method to identify binding sites using data from genome scale ChIP–chip experiments. Results from these experiments are publicly available at http://cagt10.bu.edu/BSG. The graphical framework developed here may be useful when combining predictions from numerous computational and experimental measures. Finally, we discuss how our algorithm can be used to improve the sensitivity of computational predictions of transcription factor binding specificities.

[1]  Mona Singh,et al.  A combinatorial optimization approach for diverse motif finding applications , 2006, Algorithms for Molecular Biology.

[2]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[3]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[4]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[6]  David N. Arnosti,et al.  cis-Regulatory Logic of Short-Range Transcriptional Repression in Drosophila melanogaster , 2005, Molecular and Cellular Biology.

[7]  M. Yanagida,et al.  Application of the Chromatin Immunoprecipitation Method to Identify in Vivo Protein-DNA Associations in Fission Yeast , 2000, Science's STKE.

[8]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[9]  Alan M. Moses,et al.  MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model , 2004, Genome Biology.

[10]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[11]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[12]  K. Kaski,et al.  Intensity and coherence of motifs in weighted complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[14]  Ivan Sadowski,et al.  Two Regulators of Ste12p Inhibit Pheromone-Responsive Transcription by Separate Mechanisms , 2000, Molecular and Cellular Biology.

[15]  Alan M. Moses,et al.  Position specific variation in the rate of evolution in transcription factor binding sites , 2003, BMC Evolutionary Biology.

[16]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, RECOMB '02.

[17]  Gaston H. Gonnet,et al.  Scoring functions for transcription factor binding site prediction , 2005, BMC Bioinformatics.

[18]  A. Vespignani,et al.  The architecture of complex weighted networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[20]  Nicolas E. Buchler,et al.  On schemes of combinatorial transcription logic , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[22]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[23]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[24]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[25]  Derek Y. Chiang,et al.  Flexible promoter architecture requirements for coactivator recruitment , 2006, BMC Molecular Biology.

[26]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[27]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[28]  Uri Keich,et al.  U Subtle motifs: defining the limits of motif finding algorithms , 2002, Bioinform..

[29]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[30]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[31]  Serafim Batzoglou,et al.  MotifCut: regulatory motifs finding with maximum density subgraphs , 2006, ISMB.

[32]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[33]  加藤 護 Identifying combinatorial regulation of transcription factors and binding motifs , 2004 .

[34]  Gabriela Kalna,et al.  Clustering Coefficients for Weighted Networks ∗ † , 2006 .

[35]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[36]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[37]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[38]  Michael Q. Zhang,et al.  Identifying combinatorial regulation of transcription factors and binding motifs , 2004, Genome Biology.

[39]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[40]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[41]  S. Quake,et al.  A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors , 2007, Science.

[42]  Sven Bergmann,et al.  Rewiring of the Yeast Transcriptional Network Through the Evolution of Motif Usage , 2005, Science.

[43]  K. Vieira,et al.  Combining chromatin immunoprecipitation and DNA footprinting: a novel method to analyze protein-DNA interactions in vivo. , 2002, Nucleic acids research.

[44]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[45]  Charles DeLisi,et al.  Positional clustering improves computational binding site detection and identifies novel cis-regulatory sites in mammalian GABAA receptor subunit genes , 2007, Nucleic acids research.

[46]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.