Document retrieval on repetitive string collections

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple $${\textsf{tf}}{\textsf{-}}{\textsf{idf}}$$tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

[1]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[2]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[3]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[4]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[5]  Jouni Sirén,et al.  Compressed Full-Text Indexes for Highly Repetitive Collections , 2012 .

[6]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[7]  Gonzalo Navarro,et al.  Spaces, Trees, and Colors , 2013, ACM Comput. Surv..

[8]  Gonzalo Navarro,et al.  Document Counting in Compressed Space , 2015, 2015 Data Compression Conference.

[9]  Gonzalo Navarro,et al.  A Lempel-Ziv Compressed Structure for Document Listing , 2013, SPIRE.

[10]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[11]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[12]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[13]  Faith Ellen,et al.  Space-Efficient Data Structures, Streams, and Algorithms , 2013, Lecture Notes in Computer Science.

[14]  David Richard Clark,et al.  Compact pat trees , 1998 .

[15]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[16]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[17]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[18]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[19]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[20]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[21]  Wing-Kai Hon,et al.  Indexes for Document Retrieval with Relevance , 2013, Space-Efficient Data Structures, Streams, and Algorithms.

[22]  Craig MacDonald,et al.  From Puppy to Maturity: Experiences in Developing Terrier , 2012, OSIR@SIGIR.

[23]  Gonzalo Navarro,et al.  General Document Retrieval in Compact Space , 2015, ACM J. Exp. Algorithmics.

[24]  J. Ian Munro,et al.  Document Listing on Versioned Documents , 2013, SPIRE.

[25]  Gonzalo Navarro,et al.  Document Retrieval on Repetitive Collections , 2014, ESA.

[26]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[27]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[28]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[29]  Juha Kärkkäinen,et al.  Parallel External Memory Suffix Sorting , 2015, CPM.

[30]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[31]  Wojciech Szpankowski,et al.  A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors , 1993, SIAM J. Comput..

[32]  Gonzalo Navarro,et al.  Compressed representations for web and social graphs , 2013, Knowledge and Information Systems.

[33]  Charles L. A. Clarke,et al.  Faster and smaller inverted indices with treaps , 2013, SIGIR.

[34]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[35]  Torsten Suel,et al.  Optimizing positional index structures for versioned document collections , 2012, SIGIR '12.

[36]  Gonzalo Navarro,et al.  Improved Single-Term Top-k Document Retrieval , 2015, ALENEX.

[37]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[38]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[39]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[40]  Gonzalo Navarro,et al.  Document Listing on Repetitive Collections , 2013, CPM.

[41]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[42]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[43]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[44]  Miguel A. Martínez-Prieto,et al.  Universal indexes for highly repetitive document collections , 2016, Inf. Syst..

[45]  Gonzalo Navarro,et al.  Grammar Compressed Sequences with Rank/Select Support , 2014, SPIRE.

[46]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[47]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[48]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[49]  Alastair G. Smith Information Retrieval: Implementing and Evaluating Search Engines , 2011 .

[50]  Simon J. Puglisi,et al.  Practical Efficient String Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[51]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[52]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[53]  Gonzalo Navarro,et al.  Faster Compact Top-k Document Retrieval , 2012, 2013 Data Compression Conference.

[54]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[55]  References , 1971 .

[56]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[57]  Gonzalo Navarro,et al.  Grammar compressed sequences with rank/select support , 2014, J. Discrete Algorithms.

[58]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[59]  Kunihiko Sadakane,et al.  Fast relative Lempel-Ziv self-index for similar sequences , 2014, Theor. Comput. Sci..