Lempel-Ziv parsing and sublinear-size index structures for string matching

String matching over a long text can be signiicantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the rst sublinear-size index structure. The new structure is based on Lempel-Ziv parsing of the text and has size linear in N, the size of the Lempel-Ziv parse. For a text of length n, N = O(n= log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2 + (m + L) log N + p Nm log m), where L is the number of occurrences found.

[1]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[2]  Edward M. McCreight,et al.  Priority Search Trees , 1985, SIAM J. Comput..

[3]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[4]  Arne Andersson,et al.  Suux Trees on Words , 1995 .

[5]  Ricardo A. Baeza-Yates,et al.  Optimized Binary Search and Text Retrieval , 1995, ESA.

[6]  Arne Andersson,et al.  Improved Behaviour of Tries by Adaptive Branching , 1993, Inf. Process. Lett..

[7]  Erkki Sutinen,et al.  Lempel—Ziv Index for q -Grams , 1998, Algorithmica.

[8]  Ming Gu,et al.  An efficient algorithm for dynamic text indexing , 1994, SODA '94.

[9]  Esko Ukkonen,et al.  On{line Construction of Suux Trees 1 , 1995 .

[10]  Gaston H. Gonnet,et al.  Lexicographical Indices for Text: Inverted files vs. PAT trees , 1991 .

[11]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[12]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[13]  Derick Wood,et al.  An Optimal Worst Case Algorithm for Reporting Intersections of Rectangles , 1980, IEEE Transactions on Computers.

[14]  Juha Kk,et al.  Suux Cactus: a Cross between Suux Tree and Suux Array ? , 1995 .

[15]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[16]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[17]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[19]  Philippe Jacquet,et al.  Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees , 1995, Theor. Comput. Sci..