Distribution of the length of the longest common subsequence of two multi-state biological sequences

The length of the longest common subsequence (LCS) among two biological sequences has been used as a measure of similarity, and the application of this statistic is of importance in genomic studies. Even for the simple case of two sequences of equal length and composed of binary elements with equal state probabilities, the exact distribution of the length of the LCS remains an open question. This problem is also known as an NP-hard problem in computer science. Apart from combinatorial analysis, using the finite Markov chain imbedding technique, we derive the exact distribution for the length of the LCS between two multi-state sequences of different lengths. Numerical results are provided to illustrate the theoretical results.

[1]  W. Y. Wendy Lou,et al.  ON EXACT AND LARGE DEVIATION APPROXIMATION FOR THE DISTRIBUTION OF THE LONGEST RUN IN A SEQUENCE OF TWO-STATE MARKOV DEPENDENT TRIALS , 2003 .

[2]  Markos V. Koutras,et al.  On a waiting time distribution in a sequence of Bernoulli trials , 1996 .

[3]  Kenneth S. Alexander,et al.  The Rate of Convergence of the Mean Length of the Longest Common Subsequence , 1994 .

[4]  Markos V. Koutras,et al.  Runs, scans and URN model distributions: A unified Markov chain approach , 1995 .

[5]  David Sankoff,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[6]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[7]  Markos V. Koutras,et al.  Sooner waiting time problems in a sequence of trinary trials , 1997, Journal of Applied Probability.

[8]  Michael J. Steele,et al.  Long Common Subsequences and the Proximity of two Random Strings. , 1982 .

[9]  S. Aki,et al.  Sooner and Later Waiting Time Problems for Runs in Markov Dependent Bivariate Trials , 1999 .

[10]  Katuomi Hirano,et al.  Sooner and later waiting time problems for patterns in Markov dependent trials , 2003, Journal of Applied Probability.

[11]  Markos V. Koutras,et al.  Waiting Time Distributions Associated with Runs of Fixed Length in Two-State Markov Chains , 1997 .

[12]  D Sankoff,et al.  A test for nucleotide sequence homology. , 1973, Journal of molecular biology.

[13]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Fu,et al.  DISTRIBUTION THEORY OF RUNS AND PATTERNS ASSOCIATED WITH A SEQUENCE OF MULTI-STATE TRIALS , 1996 .

[15]  R. Bundschuh High precision simulations of the longest common subsequence problem , 2001, cond-mat/0106326.

[16]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[17]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[18]  Milton Sobel,et al.  Sooner and later waiting time problems for Bernoulli trials: frequency and run quotas , 1990 .

[19]  W. Lou On Runs and Longest Run Tests: A Method of Finite Markov Chain Imbedding , 1996 .

[20]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[21]  James C. Fu,et al.  On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials , 2002, Journal of Applied Probability.

[22]  W. Y. Wendy Lou,et al.  Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Imbedding Approach , 2003 .

[23]  Susan R. Wilson,et al.  An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings , 2004 .