论文信息 - $LCSk$++: Practical similarity metric for long strings

$LCSk$++: Practical similarity metric for long strings

In this paper we present $LCSk$++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named $LCSk$. By relaxing the requirement that the $k$-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes $LCSk$++ with complexity of $O((|X|+|Y|)\log(|X|+|Y|))$ for strings $X$ and $Y$ under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute $LCSk$ as well, which gives an improvement of the $O(|X|\dot|Y|)$ algorithm presented in the original $LCSk$ paper.

[1] Marcos A. Kiwi,et al. On a Speculated Relation Between Chvátal–Sankoff Constants of Several Sequences , 2008, Combinatorics, Probability and Computing.

[2] Ronald L. Rivest,et al. Introduction to Algorithms, third edition , 2009 .

[3] Thomas G. Szymanski,et al. A fast algorithm for computing longest common subsequences , 1977, CACM.

[4] Szymon Grabowski,et al. Efficient algorithms for the longest common subsequence in k-length substrings , 2014, Inf. Process. Lett..

[5] Hao-Ren Ke,et al. Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[6] Jirí Matousek,et al. Expected Length of the Longest Common Subsequence for Large Alphabets , 2003, LATIN.

[7] Peter M. Fenwick,et al. A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[8] V. Chvátal,et al. Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[9] Gary Benson,et al. Longest Common Subsequence in k Length Substrings , 2013, SISAP.

[10] Raffaele Giancarlo,et al. Sparse Dynamic Programming for Longest Common Subsequence from Fragments , 2002, J. Algorithms.

[11] R. Bundschuh. High precision simulations of the longest common subsequence problem , 2001, cond-mat/0106326.