论文信息 - Large deviation properties for patterns

Large deviation properties for patterns

Deciding whether a given pattern is over- or under-represented according to a given background model is a key question in computational biology. Such a decision is usually made by computing some p-values reflecting the ''exceptionality'' of a pattern in a given sequence or set of sequences. In the simplest cases (short and simple patterns, simple background model, small number of sequences), an exact p-value can be computed with a tractable complexity. The realistic cases are in general too complicated to get such an exact p-value. Approximations are thus proposed (Gaussian, Poisson, Large deviation approximations). These approximations are applicable under some conditions: Gaussian approximations are valid in the central domain while Poisson and Large deviation approximations are valid for rare events. In the present paper, we prove a large deviation approximation to the double strands counting problem that refers to a counting of a given pattern in a set of sequences that arise from both strands of the genome. In that case, dependencies between a sequence and its reverse complement cannot be neglected. They are captured here for a Bernoulli model from general combinatorial properties of the pattern. A large deviation result is also provided for a set of small sequences.

M. Régnier | Jérémie Bourdon

[1] Mireille Régnier,et al. Assessing the Significance of Sets of Words , 2005, CPM.

[2] D. Landsman,et al. Statistical analysis of over-represented words in human promoter sequences. , 2004, Nucleic acids research.

[3] Manuel E Lladser,et al. Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[4] Mireille Régnier,et al. Rare Events and Conditional Events on Random Strings , 2004, Discret. Math. Theor. Comput. Sci..

[5] Amir Dembo,et al. Large Deviations Techniques and Applications , 1998 .

[6] Grégory Nuel,et al. LD-SPatt: Large Deviations Statistics for Patterns on Markov Chains , 2004, J. Comput. Biol..

[7] Mireille Régnier,et al. Comparison of Statistical Significance Criteria , 2006, J. Bioinform. Comput. Biol..

[8] W. Szpankowski. Average Case Analysis of Algorithms on Sequences , 2001 .

[9] William Stafford Noble,et al. Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[10] Jean-Stéphane Varré,et al. Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[11] Robert L. Grossman,et al. A cis-regulatory map of the Drosophila genome , 2011, Nature.

[12] Philippe Flajolet,et al. Motif statistics , 1999, Theor. Comput. Sci..

[13] Philippe Flajolet,et al. Analysis of algorithms , 2000, Random Struct. Algorithms.

[14] Gesine Reinert,et al. Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[15] Mireille Régnier,et al. A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[16] Mireille Régnier,et al. A Word Counting Graph , 2009 .

[17] Leonidas J. Guibas,et al. String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[18] Michael S. Waterman,et al. Introduction to computational biology , 1995 .

[19] G. Nuel. Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata , 2008 .