论文信息 - Approximate word matches between two random sequences

Approximate word matches between two random sequences

Given two sequences over a finite alphabet $\mathcal{L}$, the $D_2$ statistic is the number of $m$-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the $D_2$ statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For $k<m$, we look at the count of $m$-letter word matches with up to $k$ mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Susan R. Wilson | Conrad J. Burden | Miriam R. Kantorovitz

[1] Craig A. Stewart,et al. Introduction to computational biology , 2005 .

[2] Amir Dembo,et al. Some Examples of Normal Approximations by Stein’s Method , 1996 .

[3] M. F. Fuller,et al. Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[4] Louis H. Y. Chen. Poisson Approximation for Dependent Trials , 1975 .

[5] C. J. Burden,et al. Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences , 2007, Journal of Applied Probability.

[6] M. Waterman,et al. Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7] Winston A Hide,et al. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[8] Andrew D. Barbour,et al. Compound Poisson approximation: a user's guide , 2001 .

[9] Arcady R. Mushegian,et al. Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes , 2004, Bioinform..

[10] C. Stein. Approximate computation of expectations , 1986 .

[11] John E. Carpenter,et al. Assessment of the parallelization approach of d2_cluster for high‐performance sequence clustering , 2002, J. Comput. Chem..

[12] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[13] Jonas S. Almeida,et al. Alignment-free sequence comparison-a review , 2003, Bioinform..

[14] Robert Miller,et al. STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[15] W. Stemmer,et al. Genome shuffling leads to rapid phenotypic improvement in bacteria , 2002, Nature.

[16] D. Davison,et al. d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[17] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18] Svante Janson,et al. Normal Convergence by Higher Semiinvariants with Applications to Sums of Dependent Random Variables and Random Graphs , 1988 .