Practical compressed string dictionaries

The need to store and query a set of strings - a string dictionary - arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains. HighlightsWe address the problem of managing string dictionaries in compressed space.We combine data structures and compression to propose several competitive solutions.Our approaches usually outperform the state-of-the-art techniques on real-world dictionaries.All our techniques are implemented and released in a C++ library hosted at GitHub.

[1]  T. C. Hu,et al.  Optimal Computer Search Trees and Variable-Length Alphabetical Codes , 1971 .

[2]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[3]  David Salomon A Concise Introduction to Data Compression (Undergraduate Topics in Computer Science) , 2008 .

[4]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[5]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[6]  Nicole Bauer,et al.  Information Retrieval Implementing And Evaluating Search Engines , 2016 .

[7]  C BotelhoFabiano,et al.  Practical perfect hashing in nearly optimal space , 2013 .

[8]  Artur Jez,et al.  A really simple approximation of smallest grammar , 2014, Theor. Comput. Sci..

[9]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[10]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[12]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[13]  FerraginaPaolo,et al.  Compressed text indexes , 2009 .

[14]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[15]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[16]  Ming Yin,et al.  Discovery of Concept Entities from Web Sites using Web Unit Mining , 2005, Int. J. Web Inf. Syst..

[17]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[18]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[19]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[20]  FerraginaPaolo,et al.  The string B-tree , 1999 .

[21]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.

[22]  Takao Nishizeki,et al.  Efficient Compression of Web Graphs , 2008, COCOON.

[23]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[25]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[26]  BoytsovLeonid Indexing methods for approximate dictionary searching , 2011 .

[27]  Mike Liddell,et al.  Decoding prefix codes , 2006, Softw. Pract. Exp..

[28]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[29]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[30]  Ulrich Meyer,et al.  Algorithms and Experiments for the Webgraph This work is dedicated to the memory of Jop F. Sibeyn. , 2006 .

[31]  Martin Dietzfelbinger,et al.  Hash, Displace, and Compress , 2009, ESA.

[32]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[33]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[34]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[35]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[36]  Nieves R. Brisaboa,et al.  Compressed String Dictionaries , 2011, SEA.

[37]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[38]  Giuseppe Ottaviano,et al.  Fast Compressed Tries through Path Decompositions , 2011, ALENEX.

[39]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[40]  C. KiefferJ.,et al.  Grammar-based codes , 2006 .

[41]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[42]  Ingmar Weber,et al.  Output-sensitive autocompletion search , 2006, Information Retrieval.

[43]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[44]  Naresh Kumar Nagwani Clustering Based URL Normalization Technique for Web Mining , 2010, 2010 International Conference on Advances in Computer Engineering.

[45]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[46]  David Salomon,et al.  A Concise Introduction to Data Compression , 2007, Undergraduate Topics in Computer Science.

[47]  Johannes Fischer,et al.  LZ-Compressed String Dictionaries , 2014, 2014 Data Compression Conference.

[48]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[49]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[50]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[51]  Jacopo Urbani,et al.  Massive Semantic Web data compression with MapReduce , 2010, HPDC '10.

[52]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[53]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[54]  Chris Callison-Burch,et al.  Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases , 2005, ACL.

[55]  Roberto Grossi,et al.  On searching compressed string collections cache-obliviously , 2008, PODS.

[56]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[57]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[58]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[59]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[60]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[61]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[62]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[63]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[64]  Hiroshi Sakamoto,et al.  An Online Algorithm for Lightweight Grammar-Based Compression , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[65]  Miguel A. Martínez-Prieto,et al.  Querying RDF dictionaries in compressed space , 2012, SIAP.

[66]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[67]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[68]  Einar Andreas Rødland,et al.  Compact representation of k-mer de Bruijn graphs for genome read assembly , 2013, BMC Bioinformatics.

[69]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[70]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[71]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[72]  Alberto Apostolico,et al.  Graph Compression by BFS , 2009, Algorithms.

[73]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[74]  Rasmus Pagh,et al.  Practical perfect hashing in nearly optimal space , 2013, Inf. Syst..

[75]  Kazuyuki Aihara,et al.  A large-scale study of link spam detection by graph algorithms , 2007, AIRWeb '07.

[76]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[77]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[78]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[79]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[80]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[81]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[82]  Leonid Boytsov,et al.  Indexing methods for approximate dictionary searching: Comparative analysis , 2011, JEAL.

[83]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[84]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[85]  Fabrizio Luccio,et al.  Structuring labeled trees for optimal succinctness, and beyond , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[86]  Paolo Ferragina,et al.  The compressed permuterm index , 2010, TALG.

[87]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[88]  Sheng Li,et al.  An optimized algorithm for detecting and annotating regional differential methylation , 2013, BMC Bioinformatics.

[89]  Joong Chae Na,et al.  Simple Implementation of String B-Trees , 2004, SPIRE.

[90]  Szymon Grabowski,et al.  Merging Adjacency Lists for Efficient Web Graph Compression , 2011, ICMMI.

[91]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.