Self-Index based on LZ77 (thesis)

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be compressed up to the 0.1% of its original space, using the Lempel-Ziv 1977 (LZ77) compression scheme. Many of these repetitive collections handle huge amounts of text data. For that reason, we require a method to store them efficiently, while providing the ability to operate on them. The most common operations are the extraction of random portions of the collection and the search for all the occurrences of a given pattern inside the whole collection. A self-index is a data structure that stores a text in compressed form and allows to find the occurrences of a pattern efficiently. On the other hand, self-indexes can extract any substring of the collection, hence they are able to replace the original text. One of the main goals when using these indexes is to store them within main memory. In this thesis we present a scheme for random text extraction from text compressed with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that efficiently extracts text using space close to that of LZ77. The main contribution of this thesis is the first self-index based on LZ77/LZ-End and oriented to repetitive texts, which outperforms the state of the art (the RLCSA self-index) in many aspects. Finally, we present a corpus of repetitive texts, coming from several application domains. We aim at providing a standard set of texts for research and experimentation, hence this corpus is publicly available.

[1]  Gregory Kucherov,et al.  On Maximal Repetitions in Words , 1999, FCT.

[2]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[3]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[4]  Rodrigo González,et al.  Rank/select on dynamic compressed sequences and applications , 2009, Theor. Comput. Sci..

[5]  Gonzalo Navarro,et al.  Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices , 2010, JEAL.

[6]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[7]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[8]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[9]  J. Allouche Algebraic Combinatorics on Words , 2005 .

[10]  Jeffrey Shallit,et al.  The Ubiquitous Prouhet-Thue-Morse Sequence , 1998, SETA.

[11]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[12]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[13]  Richard W. Hamming,et al.  Coding and Information Theory , 2018, Feynman Lectures on Computation.

[14]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[15]  G. Navarro,et al.  Smaller and Faster Lempel-Ziv Indices ⋆ , 2007 .

[16]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[17]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[18]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[19]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[20]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[21]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[22]  Kunihiko Sadakane,et al.  An Online Algorithm for Finding the Longest Previous Factors , 2008, ESA.

[23]  Meng He,et al.  Indexing Compressed Text , 2003 .

[24]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[25]  Marcelo J. Weinberger,et al.  Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm , 1992, IEEE Trans. Inf. Theory.

[26]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[27]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[28]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[29]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[30]  Hideo Bannai,et al.  New Lower Bounds for the Maximum Number of Runs in a String , 2008, Stringology.

[31]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[32]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[33]  J. IAN MUNRO,et al.  An Implicit Data Structure Supporting Insertion, Deletion, and Search in O(log² n) Time , 1986, J. Comput. Syst. Sci..

[34]  William F. Smyth,et al.  The maximum number of of runs in a string , 2003, IWOCA 2007.

[35]  Gonzalo Navarro,et al.  Space-efficient construction of Lempel-Ziv compressed text indexes , 2011, Inf. Comput..

[36]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[37]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[38]  Mohammad Banikazemi LZB: Data Compression with Bounded References , 2009, 2009 Data Compression Conference.

[39]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[40]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[41]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[42]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[43]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[44]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[45]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[46]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[47]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[48]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[49]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[50]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[51]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[52]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[53]  Shunsuke Inenaga,et al.  On-Line Linear-Time Construction of Word Suffix Trees , 2006, CPM.

[54]  Lucian Ilie,et al.  Towards a Solution to the "Runs" Conjecture , 2008, CPM.

[55]  David Richard Clark,et al.  Compact pat trees , 1998 .

[56]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[57]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[58]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[59]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[60]  Gonzalo Navarro,et al.  Directly Addressable Variable-Length Codes , 2009, SPIRE.

[61]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[62]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[63]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[64]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[65]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[66]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[67]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[68]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[69]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[70]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[71]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[72]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[73]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[74]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[75]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[76]  Ross N. Williams,et al.  An extremely fast Ziv-Lempel data compression algorithm , 1991, [1991] Proceedings. Data Compression Conference.

[77]  Juha Kärkkäinen Repetition-Based Text Indexes , 1999 .

[78]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.