Space-Efficient Data Structures for Information Retrieval

The amount of data that people and companies store has grown exponentially over the last few years. Storing this information alone is not enough, because in order to make it useful we need to be able to efficiently search inside it. Furthermore, it is highly valuable to keep the historic data of each document stored, allowing to not only access and search inside the newest version, but also over the whole history of the documents. Grammar-based compression has proven to be very effective for repetitive data, which is the case for versioned documents. In this thesis we present several results on representing textual information and searching in it. In particular, we present text indexes for grammarbased compressed text that support searching for a pattern and extracting substrings of the input text. These are the first general indexes for grammar-based compressed text that support searching in sublinear time. In order to build our indexes, we present new results on representing binary relations in a space-efficient manner, and construction algorithms that use little space to achieve their goal. These two results have a wide range of applications. In particular, the representations for binary relations can be used as a building block for several structures in computer science, such as graphs, inverted indexes, etc. Finally, we present a new index, that uses on grammar-based compression, to solve the document listing problem. This problem deals with representing a collection of texts and searching for the documents that contain a given pattern. In spite of being similar to the classical text indexing problem, this problem has proven to be a challenge when we do not want to pay time proportional to the number of occurrences, but time proportional to the size of the result. Our proposal is designed particularly for versioned text, allowing the storage of a collection of documents with all their historic versions in little space. This is currently the smallest structure for such a purpose in practice.

[1]  Arash Farzan Succinct Representation of Trees and Graphs , 2009 .

[2]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  Igor Potapov,et al.  Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[5]  Gonzalo Navarro,et al.  k2-Trees for Compact Web Graph Representation , 2009, SPIRE.

[6]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[7]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[8]  Rajeev Raman,et al.  More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries , 2009, STACS.

[9]  Gonzalo Navarro,et al.  A Fast and Compact Web Graph Representation , 2007, SPIRE.

[10]  Klaus W. Wagner,et al.  Monotonic Coverings of Finite Sets , 1984, J. Inf. Process. Cybern..

[11]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[12]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[13]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[14]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[17]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[18]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[19]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multi-labeled trees , 2007, SODA '07.

[20]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[21]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[22]  Faith Ellen,et al.  Permuting in Place , 1995, SIAM J. Comput..

[23]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[24]  Gonzalo Navarro,et al.  Entropy-bounded representation of point grids , 2014, Comput. Geom..

[25]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[26]  Wojciech Rytter,et al.  An Efficient Pattern-Matching Algorithm for Strings with Short Descriptions , 1997, Nord. J. Comput..

[27]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[28]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[29]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[30]  Gonzalo Navarro,et al.  Faster Compact Top-k Document Retrieval , 2012, 2013 Data Compression Conference.

[31]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[32]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[33]  S. Srinivasa Rao,et al.  Adaptive Searching in Succinctly Encoded Binary Relations and Tree-Structured Documents , 2006, CPM.

[34]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[35]  Rajeev Raman,et al.  On the Redundancy of Succinct Data Structures , 2008, SWAT.

[36]  S. Srinivasa Rao,et al.  Succinct Representations of Functions , 2004, ICALP.

[37]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[38]  Alexander Golynski,et al.  Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures , 2008 .

[39]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[40]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[41]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[42]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[43]  Susana Ladra,et al.  Practical representations for web and social graphs , 2011, CIKM '11.

[44]  J. Ian Munro,et al.  Range Queries over Untangled Chains , 2010, SPIRE.

[45]  Hideo Bannai,et al.  Finding Characteristic Substrings from Compressed Texts , 2012, Int. J. Found. Comput. Sci..

[46]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[47]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[48]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[49]  Miguel A. Martínez-Prieto,et al.  Indexes for highly repetitive document collections , 2011, CIKM '11.

[50]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[51]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[52]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[53]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[54]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[55]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[56]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[57]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[58]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[59]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[60]  S. Srinivasa Rao,et al.  Succinct Ordinal Trees Based on Tree Covering , 2007, ICALP.

[61]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[62]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[63]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[64]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[65]  Gonzalo Navarro,et al.  Efficient Fully-Compressed Sequence Representations , 2012, Algorithmica.

[66]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[67]  Michael A. Bender,et al.  The Level Ancestor Problem Simplified , 2002, LATIN.

[68]  David Richard Clark,et al.  Compact pat trees , 1998 .

[69]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[70]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[71]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[72]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[73]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[74]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[75]  Gonzalo Navarro,et al.  Fast and Compact Web Graph Representations , 2010, TWEB.

[76]  Juha Kärkkäinen Repetition-Based Text Indexes , 1999 .

[77]  Wing-Kai Hon,et al.  Succinct Data Structures for Searchable Partial Sums , 2003, ISAAC.

[78]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[79]  Gonzalo Navarro,et al.  Compressed Representations of Permutations, and Applications , 2009, STACS.

[80]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[81]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[82]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[83]  Igor Potapov,et al.  Time/Space Efficient Compressed Pattern Matching , 2003, Fundam. Informaticae.

[84]  Stephane Durocher,et al.  Untangled monotonic chains and adaptive range search , 2011, Theor. Comput. Sci..

[85]  Raymond Wan,et al.  Browsing and searching compressed documents , 2003 .

[86]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[87]  Si-Qing Zheng,et al.  A Comparative Study of Efficient Algorithms for Partitioning a Sequence into Monotone Subsequences , 2007, TAMC.

[88]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[89]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[90]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[91]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[92]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[93]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[94]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[95]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[96]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[97]  Wojciech Rytter Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2003, Theor. Comput. Sci..

[98]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[99]  W. Ackermann Zum Hilbertschen Aufbau der reellen Zahlen , 1928 .

[100]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[101]  Jérémy Barbay,et al.  Succinct Representation of Labeled Graphs , 2007, Algorithmica.

[102]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[103]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[104]  Prosenjit Bose,et al.  Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing , 2009, WADS.

[105]  Torsten Suel,et al.  To index or not to index: time-space trade-offs in search engines with positional ranking functions , 2012, SIGIR '12.

[106]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[107]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[108]  Gonzalo Navarro,et al.  Space-Efficient Top-k Document Retrieval , 2012, SEA.

[109]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[110]  F. Gutiérrez,et al.  Compressed Data Structures for Web Graphs , 2022 .

[111]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[112]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[113]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).