Large deviation properties for patterns

Deciding whether a given pattern is over- or under-represented according to a given background model is a key question in computational biology. Such a decision is usually made by computing some p-values reflecting the ''exceptionality'' of a pattern in a given sequence or set of sequences. In the simplest cases (short and simple patterns, simple background model, small number of sequences), an exact p-value can be computed with a tractable complexity. The realistic cases are in general too complicated to get such an exact p-value. Approximations are thus proposed (Gaussian, Poisson, Large deviation approximations). These approximations are applicable under some conditions: Gaussian approximations are valid in the central domain while Poisson and Large deviation approximations are valid for rare events. In the present paper, we prove a large deviation approximation to the double strands counting problem that refers to a counting of a given pattern in a set of sequences that arise from both strands of the genome. In that case, dependencies between a sequence and its reverse complement cannot be neglected. They are captured here for a Bernoulli model from general combinatorial properties of the pattern. A large deviation result is also provided for a set of small sequences.

[1]  Mireille Régnier,et al.  Assessing the Significance of Sets of Words , 2005, CPM.

[2]  D. Landsman,et al.  Statistical analysis of over-represented words in human promoter sequences. , 2004, Nucleic acids research.

[3]  Manuel E Lladser,et al.  Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[4]  Mireille Régnier,et al.  Rare Events and Conditional Events on Random Strings , 2004, Discret. Math. Theor. Comput. Sci..

[5]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[6]  Grégory Nuel,et al.  LD-SPatt: Large Deviations Statistics for Patterns on Markov Chains , 2004, J. Comput. Biol..

[7]  Mireille Régnier,et al.  Comparison of Statistical Significance Criteria , 2006, J. Bioinform. Comput. Biol..

[8]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[9]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[10]  Jean-Stéphane Varré,et al.  Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[11]  Robert L. Grossman,et al.  A cis-regulatory map of the Drosophila genome , 2011, Nature.

[12]  Philippe Flajolet,et al.  Motif statistics , 1999, Theor. Comput. Sci..

[13]  Philippe Flajolet,et al.  Analysis of algorithms , 2000, Random Struct. Algorithms.

[14]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[15]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[16]  Mireille Régnier,et al.  A Word Counting Graph , 2009 .

[17]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[18]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[19]  G. Nuel Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata , 2008 .