Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands

Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de.

[1]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[2]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[3]  Jean-Michel Claverie,et al.  The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..

[4]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[5]  Michael Beckstette,et al.  Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[6]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[7]  Michael S. Waterman Probability and Statistics for Sequence Patterns , 1995 .

[8]  S. Papastavridis,et al.  A limit theorem for the number of non-overlapping occurrences of a pattern in a sequence of independent trials , 1988 .

[9]  Mireille Régnier,et al.  Assessing the Statistical Significance of Overrepresented Oligonucleotides , 2001, WABI.

[10]  R. Mullin,et al.  The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. , 1989, Biometrics.

[11]  Eytan Domany,et al.  Finding Motifs in Promoter Regions , 2005, J. Comput. Biol..

[12]  O. Chrysaphinou,et al.  A limit theorem on the number of overlapping appearances of a pattern in a sequence of independent trials , 1988 .

[13]  C. D. Kemp "Stuttering - Poisson" distributions , 1967 .

[14]  Steven E. Brenner,et al.  WebLogo: A sequence logo generator - eScholarship , 2004 .

[15]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[16]  Martin Vingron,et al.  Statistical detection of cooperative transcription factors with similarity adjustment , 2009, German Conference on Bioinformatics.

[17]  A. Barbour,et al.  Poisson Approximation , 1992 .

[18]  Stéphane Robin,et al.  Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences , 2002, J. Comput. Biol..

[19]  B. Alberts,et al.  Molecular Biology of the Cell (Fifth Edition) , 2008 .

[20]  Martin Vingron,et al.  A new statistical model to select target sequences bound by transcription factors. , 2006, Genome informatics. International Conference on Genome Informatics.

[21]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[22]  William H. Press,et al.  Numerical recipes in C , 2002 .

[23]  Anant P. Godbole,et al.  Poisson approximations for runs and patterns of rare events , 1991, Advances in Applied Probability.

[24]  Martin Vingron,et al.  Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[25]  Sven Rahmann,et al.  Dynamic Programming Algorithms for Two Statistical Problems in Computational Biology , 2003, WABI.

[26]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[27]  Stéphane Robin,et al.  A compound Poisson model for word occurrences in DNA sequences , 2002 .

[28]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[29]  Uri Keich sFFT: A Faster Accurate Computation of the p-Value of the Entropy Score , 2005, J. Comput. Biol..

[30]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[31]  Jürgen Kleffe,et al.  Exact computation of pattern probabilities in random sequences generated by Markov chains , 1990, Comput. Appl. Biosci..

[32]  Naftali Tishby,et al.  Efficient Exact p-Value Computation for Small Sample, Sparse, and Surprising Categorical Data , 2004, J. Comput. Biol..

[33]  S. E. Perry,et al.  Binding Site Selection for the Plant MADS Domain Protein AGL15 , 2003, Journal of Biological Chemistry.

[34]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[35]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[36]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[37]  Louis H. Y. Chen,et al.  Importance Sampling of Word Patterns in DNA and Protein Sequences , 2008, J. Comput. Biol..

[38]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[39]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[40]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[41]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[42]  Etienne Roquain,et al.  Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary markov chain , 2007, Advances in Applied Probability.

[43]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[44]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[45]  Chufang Wu The Distributions of the Frequency of Occurrence of Nucleotide Subsequences , 2005 .