Repetition-Based Text Indexes

Repetition-based indexing is a new scheme for preprocessing a text to support fast pattern matching queries. The scheme provides a general framework for representing information about repetitions, i.e., multiple occurrences of the same string in the text, and for using the information in pattern matching. Well-known text indexes, such as suux trees, suux arrays, DAWGs and their variations, which we collectively call suux indexes, can be seen as instances of the scheme. Based on the scheme, we introduce the Lempel{Ziv index, a new text index for string matching. It uses the repetition information in a Lempel{Ziv parse, which is a division of the text into non-overlapping substrings with earlier occurrences, and which is also used in the Ziv{Lempel family of text compression methods. The Lempel{Ziv index ooers a possibility for a space{time tradeoo. The space requirement can be smaller than for suux indexes by up to a logarithmic factor, while the query time is larger but still sublinear in the length of the text. The only previous text index ooering a space{time tradeoo is the sparse suux tree. The Lempel{Ziv index improves on the results of the sparse suux tree in many cases. Text indexes for q-gram matching, i.e., for matching string patterns of length q, are used in some approximate string matching algorithms. We introduce a new repetition-based q-gram index, the Lempel{Ziv index for q-grams, that has asymptotically optimal space requirement and query time provided that q is a constant or grows slowly enough with respect to the length of the text. Queries are as fast as with traditional q-gram indexes, but the space requirement can be smaller by a logarithmic factor.