Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.

[1]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[2]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[3]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[4]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[5]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[6]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[7]  Jean-Michel Claverie,et al.  The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..

[8]  Ting Chen,et al.  Modeling Gene Expression with Differential Equations , 1998, Pacific Symposium on Biocomputing.

[9]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[10]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[11]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[12]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[13]  E. Davidson Genomic Regulatory Systems , 2001 .

[14]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[15]  James M. Bower,et al.  Computational modeling of genetic and biochemical networks , 2001 .

[16]  Martin C. Frith,et al.  Detection of cis -element clusters in higher eukaryotic DNA , 2001, Bioinform..

[17]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[18]  Tommi S. Jaakkola,et al.  On the Dirichlet Prior and Bayesian Regularization , 2002, NIPS.

[19]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[20]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[21]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[22]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[23]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[24]  Massimo Vergassola,et al.  Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo , 2002, BMC Bioinformatics.

[25]  Alexander E. Kel,et al.  TRANSCompel®: a database on composite regulatory elements in eukaryotic genes , 2002, Nucleic Acids Res..

[26]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[27]  Nonparametric Convergence Assessment for MCMC Model Selection , 2003 .

[28]  Satoru Miyano,et al.  Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection , 2003, ECCB.

[29]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[30]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[31]  Dirk Husmeier,et al.  Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks , 2003, Bioinform..

[32]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[33]  R. Tjian,et al.  Transcription regulation and animal diversity , 2003, Nature.

[34]  R. Young,et al.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays , 2004, Nature Genetics.

[35]  Alberto Riva,et al.  MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes , 2005, BMC Bioinformatics.

[36]  Alice Young,et al.  Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[38]  Benno Schwikowski,et al.  Predicting protein-peptide interactions via a network-based motif sampler , 2004, ISMB/ECCB.

[39]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[40]  Satoru Miyano,et al.  Predicting gene regulation by sigma factors in Bacillus subtilis from genome-wide data , 2004, ISMB/ECCB.

[41]  Wing Hung Wong,et al.  Determination of Local Statistical Significance of Patterns in Markov Sequences with Application to Promoter Element Identification , 2004, J. Comput. Biol..

[42]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[43]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[44]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[45]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[46]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[47]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[48]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[49]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[50]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[51]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[52]  Eytan Domany,et al.  Finding Motifs in Promoter Regions , 2005, J. Comput. Biol..

[53]  Roberto Casarin,et al.  Solution Manual for Selected Problems, Monte Carlo Statistical Methods, 2nd Edition, Christian P. Robert and George Casella , 2005 .

[54]  E. Davidson Genomic Regulatory Systems: Development and Evolution , 2005 .

[55]  Dirk Husmeier,et al.  Introduction to Learning Bayesian Networks from Data , 2005 .

[56]  Ernest Fraenkel,et al.  TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs , 2005 .

[57]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[58]  N. Friedman,et al.  CIS: compound importance sampling method for protein-DNA binding site p-value estimation , 2005, Bioinform..

[59]  M. Kon,et al.  Integrating genomic data to predict transcription factor binding. , 2005, Genome informatics. International Conference on Genome Informatics.

[60]  Ernest Fraenkel,et al.  TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs , 2005, Bioinform..

[61]  Lei Shen,et al.  Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes , 2005, Bioinform..

[62]  M. Eisenstein A plan for when the ChIPs are down , 2005, Nature Methods.

[63]  Irene K. Moore,et al.  A genomic code for nucleosome positioning , 2006, Nature.

[64]  Darren J. Wilkinson Stochastic Modelling for Systems Biology , 2006 .

[65]  Christopher L. Warren,et al.  Defining the sequence-recognition profile of DNA-binding molecules. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Christian J Stoeckert,et al.  Clustering of genes into regulons using integrated modeling-COGRIM , 2007, Genome Biology.

[67]  Pietro Liò,et al.  Computational framework for the prediction of transcription factor binding sites by multiple data integration , 2006, BMC Neuroscience.

[68]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[69]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[70]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[71]  Trey Ideker,et al.  Integrated Assessment and Prediction of Transcription Factor Binding , 2006, PLoS Comput. Biol..

[72]  Kevin Murphy,et al.  Modelling Gene Expression Data using Dynamic Bayesian Networks , 2006 .

[73]  T. Ideker,et al.  Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae , 2006, Journal of biology.

[74]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[75]  Enrique Blanco,et al.  ABS: a database of Annotated regulatory Binding Sites from orthologous promoters , 2005, Nucleic Acids Res..

[76]  Ernest Fraenkel,et al.  High-resolution computational models of genome binding events , 2006, Nature Biotechnology.

[77]  Charles DeLisi,et al.  Machine learning methods for transcription data integration , 2006, IBM J. Res. Dev..

[78]  E. Ukkonen,et al.  Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity , 2006, Cell.

[79]  Dirk Husmeier,et al.  A regularized discriminative model for the prediction of protein-peptide interactions , 2006, Bioinform..

[80]  Alexander J. Hartemink,et al.  Informative priors based on transcription factor structural class improve de novo motif discovery , 2006, ISMB.

[81]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[82]  Obi L. Griffith,et al.  ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation , 2006, Bioinform..

[83]  Francesca Chiaromonte,et al.  ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. , 2006, Genome research.

[84]  Nathaniel D. Heintzman,et al.  The gateway to transcription: identifying, characterizing and understanding promoters in the eukaryotic genome , 2007, Cellular and Molecular Life Sciences.

[85]  Ivan V. Bajic Detection-theoretic analysis of MatInspector , 2006, IEEE Transactions on Signal Processing.

[86]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[87]  W. Walker,et al.  SP1 transcription factors in male germ cell development and differentiation , 2007, Molecular and Cellular Endocrinology.

[88]  I. Shmulevich,et al.  Probabilistic Framework for Transcription Factor Binding Prediction , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[89]  Alexander J. Hartemink,et al.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast , 2007, PLoS Comput. Biol..

[90]  S. Kasif,et al.  Quantifying DNA–protein binding specificities by using oligonucleotide mass tags and mass spectroscopy , 2007, Proceedings of the National Academy of Sciences.

[91]  S. Quake,et al.  A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors , 2007, Science.

[92]  Af Smit,et al.  RepeatMasker software program (computer program), ver. 3.1.8. Seattle: Institute for Systems Biology. , 2007 .

[93]  Alexander J. Hartemink,et al.  Nucleosome Occupancy Information Improves de novo Motif Discovery , 2007, RECOMB.

[94]  Stephen J. Roberts,et al.  Probabilistic Modeling in Bioinformatics and Medical Informatics , 2010 .