Top Tree Compression of Tries

We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length n over an alphabet of size $$\sigma$$ into a compressed data structure of worst-case optimal size $$O(n/\log _\sigma n)$$ that given a pattern string P of length m determines if P is a prefix of one of the strings in time $$O(\min (m\log \sigma ,m + \log n))$$ . We show that this query time is in fact optimal regardless of the size of the data structure. Existing solutions either use $$\Omega (n)$$ space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine that achieves worst-case o(n) space. Along the way, we develop several interesting data structures that work on a pointer machine and are of independent interest. These include an optimal data structures for random access to a grammar-compressed string and an optimal data structure for a variant of the level ancestor problem.

[1]  Torben Hagerup,et al.  Sorting and Searching on the Word RAM , 1998, STACS.

[2]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[3]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[4]  Kasper Green Larsen,et al.  Higher-dimensional orthogonal range reporting and rectangle stabbing in the pointer machine model , 2012, SoCG '12.

[5]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[6]  Masaru Kitsuregawa,et al.  A Self-adaptive Classifier for Efficient Text-stream Processing , 2014, COLING.

[7]  J. Ian Munro,et al.  Data Structures for Path Queries , 2016, ACM Trans. Algorithms.

[8]  Hideo Bannai,et al.  Dynamic Index and LZ Factorization in Compressed Space , 2016, Stringology.

[9]  Philip Bille,et al.  Deterministic Indexing for Packed Strings , 2017, CPM.

[10]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[11]  Michael A. Bender,et al.  The Level Ancestor Problem simplified , 2004, Theor. Comput. Sci..

[12]  Hiroki Arimura,et al.  Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression , 2017, SPIRE.

[13]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[14]  Johannes Fischer,et al.  Lempel–Ziv-78 Compressed String Dictionaries , 2018, Algorithmica.

[15]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[16]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[17]  Robert E. Tarjan,et al.  Variations on the Common Subexpression Problem , 1980, J. ACM.

[18]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[19]  Bernard Chazelle,et al.  Simplex Range Reporting on a Pointer Machine , 1995, Comput. Geom..

[20]  Robert E. Tarjan,et al.  A Class of Algorithms which Require Nonlinear Time to Maintain Disjoint Sets , 1979, J. Comput. Syst. Sci..

[21]  Rajeev Raman,et al.  Tree Compression with Top Trees Revisited , 2015, SEA.

[22]  Masao Fuketa,et al.  Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries , 2017, SPIRE.

[23]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[24]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[25]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..

[26]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[27]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[28]  Rajeev Raman,et al.  Improved Practical Compact Dynamic Tries , 2015, SPIRE.

[29]  Gonzalo Navarro,et al.  Universal Compressed Text Indexing , 2018, Theor. Comput. Sci..

[30]  Bernard Chazelle,et al.  Lower bounds for orthogonal range searching: I. The reporting case , 1990, JACM.

[31]  Stephen Alstrup,et al.  Improved Algorithms for Finding Level Ancestors in Dynamic Trees , 2000, ICALP.

[32]  Philip Bille,et al.  Tight Bounds for Top Tree Compression , 2017, SPIRE.

[33]  Gad M. Landau,et al.  Tree compression with top trees , 2015, Inf. Comput..

[34]  Pawel Gawrychowski,et al.  Slowing Down Top Trees for Better Worst-Case Compression , 2018, CPM.

[35]  Veli Mäkinen Compact Suffix Array - A Space-Efficient Full-Text Index , 2003, Fundam. Informaticae.

[36]  Franz-Josef Brandenburg,et al.  Recognizing Optimal 1-Planar Graphs in Linear Time , 2016, Algorithmica.

[37]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[38]  Robert E. Tarjan,et al.  Biased Search Trees , 1985, SIAM J. Comput..

[39]  Gad M. Landau,et al.  Top Tree Compression of Tries , 2019, ISAAC.

[40]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[41]  Ian H. Witten,et al.  Bonsai: A compact representation of trees , 1993, Softw. Pract. Exp..

[42]  Travis Gagie,et al.  Relative FM-Indexes , 2014, SPIRE.

[43]  Hiroki Arimura,et al.  Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing , 2017, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[44]  Giuseppe Ottaviano,et al.  Fast Compressed Tries through Path Decompositions , 2011, ALENEX.

[45]  Mikkel Thorup,et al.  Maintaining information in fully dynamic trees with top trees , 2003, TALG.

[46]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[47]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[48]  Sebastiano Vigna,et al.  Dynamic Z-Fast Tries , 2010, SPIRE.

[49]  Gonzalo Navarro,et al.  Relative Suffix Trees , 2015, Comput. J..

[50]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[51]  Philip Bille,et al.  Time-space trade-offs for Lempel-Ziv compressed indexing , 2018, Theor. Comput. Sci..

[52]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[53]  Masao Fuketa,et al.  Compressed double-array tries for string dictionaries supporting fast lookup , 2017, Knowledge and Information Systems.

[54]  Paul F. Dietz Finding Level-Ancestors in Dynamic Trees , 1991, WADS.

[55]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[56]  Ranjan Sinha,et al.  Engineering scalable, cache and space efficient tries for strings , 2010, The VLDB Journal.

[57]  Jun-Ichi Aoe An Efficient Digital Search Algorithm by Using a Double-Array Structure , 1989, IEEE Trans. Software Eng..

[58]  F. P. Preparata,et al.  Convex hulls of finite sets of points in two and three dimensions , 1977, CACM.

[59]  Johannes Fischer,et al.  LZ-Compressed String Dictionaries , 2014, 2014 Data Compression Conference.

[60]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[61]  Robert HOOD,et al.  Real-Time Queue Operation in Pure LISP , 1980, Inf. Process. Lett..

[62]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..