Efficient Compressed Wavelet Trees over Large Alphabets

The {\em wavelet tree} is a flexible data structure that permits representing sequences $S[1,n]$ of symbols over an alphabet of size $\sigma$, within compressed space and supporting a wide range of operations on $S$. When $\sigma$ is significant compared to $n$, current wavelet tree representations incur in noticeable space or time overheads. In this article we introduce the {\em wavelet matrix}, an alternative representation for large alphabets that retains all the properties of wavelet trees but is significantly faster. We also show how the wavelet matrix can be compressed up to the zero-order entropy of the sequence without sacrificing, and actually improving, its time performance. Our experimental results show that the wavelet matrix outperforms all the wavelet tree variants along the space/time tradeoff map.

[1]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[2]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[5]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[6]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[7]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[8]  Roberto Grossi,et al.  Wavelet Trees: From Theory to Practice , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Gonzalo Navarro,et al.  Efficient Fully-Compressed Sequence Representations , 2012, Algorithmica.

[11]  Gonzalo Navarro,et al.  Fast In-Memory XPath Search over Compressed Text and Tree Indexes , 2009, ArXiv.

[12]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[13]  Enno Ohlebusch,et al.  Bidirectional Search in a String with Wavelet Trees , 2010, CPM.

[14]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[15]  David Richard Clark,et al.  Compact pat trees , 1998 .

[16]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[17]  Gonzalo Navarro,et al.  Implicit indexing of natural language text by reorganizing bytecodes , 2012, Information Retrieval.

[18]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2013, Theor. Comput. Sci..

[19]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[20]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[21]  Gonzalo Navarro,et al.  A Fun Application of Compact Data Structures to Indexing Geographic Data , 2010, FUN.

[22]  Patrick K. Nicholson,et al.  Space Efficient Wavelet Tree Construction , 2011, SPIRE.

[23]  Gonzalo Navarro,et al.  On compressing permutations and adaptive sorting , 2011, Theor. Comput. Sci..

[24]  Raffaele Giancarlo,et al.  The myriad virtues of Wavelet Trees , 2009, Inf. Comput..

[25]  Gonzalo Navarro,et al.  Compact Rich-Functional Binary Relation Representations , 2010, LATIN.

[26]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[27]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[28]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[31]  J. Ian Munro,et al.  Succinct Representations of Dynamic Strings , 2010, SPIRE.

[32]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[33]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[34]  S. Srinivasa Rao,et al.  Adaptive Searching in Succinctly Encoded Binary Relations and Tree-Structured Documents , 2006, CPM.

[35]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[36]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.

[37]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multilabeled trees , 2011, TALG.

[38]  Gonzalo Navarro,et al.  Compressing Huffman Models on Large Alphabets , 2013, 2013 Data Compression Conference.

[39]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[40]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[41]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[42]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[43]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[44]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[45]  Jérémy Barbay,et al.  Succinct Representation of Labeled Graphs , 2007, Algorithmica.

[46]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[47]  Prosenjit Bose,et al.  Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing , 2009, WADS.

[48]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[49]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[50]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[51]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[52]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[53]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[54]  German Tischler On Wavelet Tree Construction , 2011, CPM.

[55]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[56]  Diego Arroyuelo,et al.  Compressed Self-indices Supporting Conjunctive Queries on Document Collections , 2010, SPIRE.

[57]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2013, SODA.

[58]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[59]  Christos Makris,et al.  Wavelet trees: A survey , 2012, Comput. Sci. Inf. Syst..

[60]  Gonzalo Navarro,et al.  The Wavelet Matrix , 2012, SPIRE.

[61]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[62]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[63]  Luís M. S. Russo,et al.  Space-efficient data-analysis queries on grids , 2013, Theor. Comput. Sci..

[64]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..