The wavelet matrix: An efficient wavelet tree for large alphabets

The wavelet tree is a flexible data structure that permits representing sequences S 1 , n of symbols over an alphabet of size ?, within compressed space and supporting a wide range of operations on S. When ? is significant compared to n, current wavelet tree representations incur in noticeable space or time overheads. In this article we introduce the wavelet matrix, an alternative representation for large alphabets that retains all the properties of wavelet trees but is significantly faster. We also show how the wavelet matrix can be compressed up to the zero-order entropy of the sequence without sacrificing, and actually improving, its time performance. Our experimental results show that the wavelet matrix outperforms all the wavelet tree variants along the space/time tradeoff map. HighlightsWe improve current wavelet tree representations on large alphabets.We reduce the number of operations needed to solve access, rank and select queries.We introduce Huffman compression on the sequence to further reduce space and time.We show that the resulting structures are the most efficient to represent sequences on large alphabets in most aspects.

[1]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[2]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[3]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[4]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[5]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[6]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[7]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[8]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[9]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[10]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[11]  Luís M. S. Russo,et al.  Space-efficient data-analysis queries on grids , 2013, Theor. Comput. Sci..

[12]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[13]  Gonzalo Navarro,et al.  Implicit indexing of natural language text by reorganizing bytecodes , 2012, Information Retrieval.

[14]  Gonzalo Navarro,et al.  A Fun Application of Compact Data Structures to Indexing Geographic Data , 2010, FUN.

[15]  Gonzalo Navarro,et al.  Fast in-memory XPath search using compressed indexes , 2010, ICDE.

[16]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Jinhui Yang,et al.  Canonical Huffman code based full-text index , 2008 .

[19]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2013, Theor. Comput. Sci..

[20]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[21]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[22]  David Richard Clark,et al.  Compact pat trees , 1998 .

[23]  Gonzalo Navarro,et al.  Compact Rich-Functional Binary Relation Representations , 2010, LATIN.

[24]  J. Ian Munro,et al.  Succinct Representations of Dynamic Strings , 2010, SPIRE.

[25]  Patrick K. Nicholson,et al.  Space Efficient Wavelet Tree Construction , 2011, SPIRE.

[26]  Gonzalo Navarro,et al.  On compressing permutations and adaptive sorting , 2011, Theor. Comput. Sci..

[27]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[28]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[29]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[30]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[31]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[32]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.

[33]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[34]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[35]  Gonzalo Navarro,et al.  Fast, Small, Simple Rank/Select on Bitmaps , 2012, SEA.

[36]  Raffaele Giancarlo,et al.  The myriad virtues of Wavelet Trees , 2009, Inf. Comput..

[37]  Roberto Grossi,et al.  Wavelet Trees: From Theory to Practice , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[38]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[39]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[40]  Sebastiano Vigna,et al.  Broadword Implementation of Rank/Select Queries , 2008, WEA.

[41]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[42]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[43]  S. Srinivasa Rao,et al.  Adaptive Searching in Succinctly Encoded Binary Relations and Tree-Structured Documents , 2006, CPM.

[44]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[45]  Gonzalo Navarro,et al.  Efficient Fully-Compressed Sequence Representations , 2012, Algorithmica.

[46]  David Salomon,et al.  Data Compression , 2000, Springer Berlin Heidelberg.

[47]  Diego Arroyuelo,et al.  Compressed Self-indices Supporting Conjunctive Queries on Document Collections , 2010, SPIRE.

[48]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2013, SODA.

[49]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multilabeled trees , 2011, TALG.

[50]  Gonzalo Navarro,et al.  Compressing Huffman Models on Large Alphabets , 2013, 2013 Data Compression Conference.

[51]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[52]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[53]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[54]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[55]  Christos Makris,et al.  Wavelet trees: A survey , 2012, Comput. Sci. Inf. Syst..

[56]  Gonzalo Navarro,et al.  The Wavelet Matrix , 2012, SPIRE.

[57]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes , 2011, SPIRE.

[58]  Gonzalo Navarro,et al.  Fast In-Memory XPath Search over Compressed Text and Tree Indexes , 2009, ArXiv.

[59]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[60]  Enno Ohlebusch,et al.  Bidirectional Search in a String with Wavelet Trees , 2010, CPM.

[61]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[62]  Jérémy Barbay,et al.  Succinct Representation of Labeled Graphs , 2007, Algorithmica.

[63]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[64]  Prosenjit Bose,et al.  Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing , 2009, WADS.

[65]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[66]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[67]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[68]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[69]  German Tischler On Wavelet Tree Construction , 2011, CPM.

[70]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.