论文信息 - Matching for Run-Length Encoded Strings

Matching for Run-Length Encoded Strings

1 Motivation Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of nding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(jXj jY j) time. In this paper, we develop signiicantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (; i), each consisting of an alphabet symbol and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of. For example, the string aaaabbbbcccabbbbcc can be encoded as a 4 b 4 c 3 a 1 b 4 c 2. Such a run-length encoded string can be signiicantly shorter than the expanded string representation. Indeed, run-length coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels. The need to approximately match run-length encoded strings emerged during development of an optical character recognition (OCR) system. This system, built in association with Data Capture Systems Inc. 8], has been designed to achieve a low substitution error-rate via xed-font character recognition. The ith row or column of pixels in a given query character image will deene a binary string containing a small number of white-black transitions. By comparing this run-length encoded string against the ith row or column of each of the character image-models, we can identify

Gad M. Landau | Steven Skiena | Alberto Apostolico

[1] János Csirik,et al. An Improved Algorithm for Computing the Edit Distance of Run-Length Coded Strings , 1995, Inf. Process. Lett..

[2] Alberto Apostolico,et al. String Editing and Longest Common Subsequences , 1997, Handbook of Formal Languages.

[3] 田村智幸. Bounds on the Complexity of the Longest Common Subsequence Problem , 1976 .

[4] Steven Skiena,et al. Geometric decision trees for optical character recognition (extended abstract) , 1997, SCG '97.

[5] Daniel S. Hirschberg,et al. A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[6] Alfred V. Aho,et al. Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[7] Gen-Huey Chen,et al. On the Set LCS and Set-Set LCS Problems , 1993, J. Algorithms.

[8] Daniel S. Hirschberg,et al. An Information-Theoretic Lower Bound for the Longest Common Subsequence Problem , 1977, Inf. Process. Lett..

[9] Mike Paterson,et al. A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[10] Grzegorz Rozenberg,et al. Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.