CIS: compound importance sampling method for protein-DNA binding site p-value estimation

MOTIVATION A key aspect of transcriptional regulation is the binding of transcription factors to sequence-specific binding sites that allow them to modulate the expression of nearby genes. Given models of such binding sites, one can scan regulatory regions for putative binding sites and construct a genome-wide regulatory network. In such genome-wide scans, it is crucial to control the amount of false positive predictions. Recently, several works demonstrated the benefits of modeling dependencies between positions within the binding site. Yet, computing the statistical significance of putative binding sites in this scenario remains a challenge. RESULTS We present a general, accurate and efficient method for computing p-values of putative binding sites that is applicable to a large class of probabilistic binding site and background models. We demonstrate the accuracy of the method on synthetic and real-life data. AVAILABILITY The procedure for scanning DNA sequences and computing the statistical significance of putative binding site scores is available upon request at http://compbio.cs.huji.ac.il/CIS/ CONTACT: nir@cs.huji.ac.il.

[1]  Wing Hung Wong,et al.  Determination of Local Statistical Significance of Patterns in Markov Sequences with Application to Promoter Element Identification , 2004, J. Comput. Biol..

[2]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[3]  F. P. Roth,et al.  A non-parametric model for transcription factor binding sites. , 2003, Nucleic acids research.

[4]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[5]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[6]  Xin Chen,et al.  The TRANSFAC system on gene expression regulation , 2001, Nucleic Acids Res..

[7]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[8]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[9]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[10]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[11]  Ming-Hui Chen,et al.  On Monte Carlo methods for estimating ratios of normalizing constants , 1997 .

[12]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[13]  Xiao-Li Meng,et al.  SIMULATING RATIOS OF NORMALIZING CONSTANTS VIA A SIMPLE IDENTITY: A THEORETICAL EXPLORATION , 1996 .

[14]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .