Approximate String Matching with Lempel-Ziv Compressed Indexes

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lemplel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.

[1]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[2]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[3]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[4]  Johannes Nowak,et al.  Text indexing with errors , 2007, J. Discrete Algorithms.

[5]  Gonzalo Navarro,et al.  Average-optimal single and multiple approximate string matching , 2004, JEAL.

[6]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[7]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[8]  Arlindo L. Oliveira,et al.  Dotted Suffix Trees A Structure for Approximate Text Indexing , 2006, SPIRE.

[9]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[10]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[11]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[12]  Tak Wah Lam,et al.  Improved Approximate String Matching Using Compressed Suffix Data Structures , 2007, Algorithmica.

[13]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[14]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[15]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[16]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[17]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[18]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[19]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[20]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[21]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[22]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[23]  Tak Wah Lam,et al.  A Linear Size Index for Approximate Pattern Matching , 2006, CPM.

[24]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[25]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[26]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[27]  Ricardo A. Baeza-Yates,et al.  Very Fast and Simple Approximate String Matching , 1999, Inf. Process. Lett..

[28]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[29]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[30]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[31]  Wing-Kai Hon,et al.  Approximate String Matching Using Compressed Suffix Arrays , 2004, CPM.