An improved compound Poisson model for the number of motif hits in DNA sequences

Motivation: Transcription factors play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a transcription factor can be described in terms of position frequency matrices. When scanning a sequence for matches to a position frequency matrix, one needs to determine a cut‐off, which then in turn results in a certain number of hits. In this paper we describe how to compute the distribution of match scores and of the number of motif hits, which are the prerequisites to perform motif hit enrichment analysis. Results: We put forward an improved compound Poisson model that supports general order‐d Markov background models and which computes the number of motif‐hits more accurately than earlier models. We compared the accuracy of the improved compound Poisson model with previously proposed models across a range of parameters and motifs, demonstrating the improvement. The importance of the order‐d model is supported in a case study using CpG‐island sequences. Availability and implementation: The method is available as a Bioconductor package named ‘motifcounter’ https://bioconductor.org/packages/motifcounter. Contact: kopp@molgen.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[2]  I. Csiszár,et al.  The consistency of the BIC Markov order estimator , 2000 .

[3]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[4]  Nigel Chaffey,et al.  Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P. Molecular biology of the cell. 4th edn. , 2003 .

[5]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[6]  Jean-Stéphane Varré,et al.  Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[7]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[8]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[9]  Jerzy Neyman,et al.  The testing of statistical hypotheses in relation to probabilities a priori , 1933, Mathematical Proceedings of the Cambridge Philosophical Society.

[10]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[11]  Martin Vingron,et al.  PASTAA: identifying transcription factors associated with sets of co-regulated genes , 2008, Bioinform..

[12]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[13]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[14]  Graziano Pesole,et al.  Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes , 2009, Nucleic Acids Res..

[15]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[16]  Martin Vingron,et al.  Predicting transcription factor affinities to DNA from a biophysical model , 2007, Bioinform..

[17]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[18]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[19]  Timothy L. Bailey,et al.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data , 2010, BMC Bioinformatics.

[20]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[21]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[22]  Sven Rahmann,et al.  Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size , 2010, WABI.

[23]  Jacques van Helden,et al.  RSAT: regulatory sequence analysis tools , 2008, Nucleic Acids Res..

[24]  Vladimir B. Bajic,et al.  HOCOMOCO: a comprehensive collection of human transcription factor binding sites models , 2012, Nucleic Acids Res..

[25]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[26]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[27]  Gary D. Stormo,et al.  MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices , 1995, Comput. Appl. Biosci..

[28]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[29]  Martin Vingron,et al.  Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands , 2008, J. Comput. Biol..