论文信息 - Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

Let a seed , S , be a string from the alphabet {1,*}, of arbitrary length k , which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S . We say that the 1s at the corresponding positions in T are covered . We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p . We refer to this new probability distribution as C nSp , for covered , with S being the seed. We present an efficient method to calculate this distribution exactly . Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

Gary Benson | Denise Y. F. Mak | G. Benson

[1] Jeremy Buhler,et al. Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[2] Exact Distribution of a Spaced Seed Statistic for Applications in DNA Repeat Detection , 2008 .

[3] G. Benson,et al. Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[4] Gary Benson,et al. On the Distribution of K-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance , 1998, J. Comput. Biol..

[5] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[6] Gary Benson,et al. All Hits All The Time: Parameter Free Calculation of Seed Sensitivity , 2007, APBC.

[7] Gary Benson,et al. Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. , 2004, Genome research.

[8] W. Y. Wendy Lou,et al. The exact distribution of the k-tuple statistic for sequence homology , 2003 .

[9] Juha Kärkkäinen,et al. Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[10] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[11] Bin Ma,et al. On spaced seeds for similarity search , 2004, Discret. Appl. Math..