Compression, Indexing, and Retrieval for Massive String Data

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

[1]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[3]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[4]  Roberto Grossi,et al.  On searching compressed string collections cache-obliviously , 2008, PODS.

[5]  Djamal Belazzougui Succinct Dictionary Matching with No Slowdown , 2010, CPM.

[6]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[7]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[8]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[9]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[10]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[11]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[12]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[13]  Roberto Grossi,et al.  Nearly Tight Bounds on the Encoding Length of the Burrows-Wheeler Transform , 2008, ANALCO.

[14]  Jeffrey Scott Vitter,et al.  Ordered Pattern Matching: Towards Full-Text Retrieval , 2006 .

[15]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[16]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[17]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.

[18]  Eduardo Sany Laber,et al.  LATIN 2008: Theoretical Informatics, 8th Latin American Symposium, Búzios, Brazil, April 7-11, 2008, Proceedings , 2008, Lecture Notes in Computer Science.

[19]  A. Nijenhuis Combinatorial algorithms , 1975 .

[20]  Roberto Grossi,et al.  Rank-Sensitive Data Structures , 2005, SPIRE.

[21]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[22]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[23]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[24]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[25]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[26]  Wing-Kai Hon,et al.  On Entropy-Compressed Text Indexing in External Memory , 2009, SPIRE.

[27]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[28]  Fabrizio Luccio,et al.  Structuring labeled trees for optimal succinctness, and beyond , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[29]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[30]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[31]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[32]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[33]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[34]  Raffaele Giancarlo,et al.  The myriad virtues of Wavelet Trees , 2009, Inf. Comput..

[35]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[36]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[37]  Wing-Kai Hon,et al.  I/O-Efficient Compressed Text Indexes: From Theory to Practice , 2010, 2010 Data Compression Conference.

[38]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[39]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[40]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[41]  S. Srinivasa Rao,et al.  Full-Text Indexes in External Memory , 2002, Algorithms for Memory Hierarchies.

[42]  Wing-Kai Hon,et al.  Succinct Index for Dynamic Dictionary Matching , 2009, ISAAC.

[43]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[44]  Wing-Kai Hon,et al.  PSI-RA: A parallel sparse index for read alignment on genomes , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[45]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[46]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[47]  Gonzalo Navarro,et al.  Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays , 2004, ISAAC.

[48]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[49]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[50]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[51]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.

[52]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[53]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[54]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[55]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[56]  G. Navarro,et al.  A Compressed Text Index on Secondary Memory ∗ , 2007 .

[57]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[58]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[59]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[60]  W. Marsden I and J , 2012 .

[61]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[62]  Paolo Ferragina,et al.  Compressed permuterm index , 2007, SIGIR.

[63]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[64]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[65]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[66]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[67]  Angela C. Sodan,et al.  Parallelism via Multithreaded and Multicore CPUs , 2010, Computer.

[68]  Gonzalo Navarro,et al.  A Lempel-Ziv Text Index on Secondary Storage , 2007, CPM.

[69]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[70]  Wing-Kai Hon,et al.  Compressed Index for Dictionary Matching , 2008, Data Compression Conference (dcc 2008).

[71]  Siu-Ming Yiu,et al.  Succinct Text Indexing with Wildcards , 2009, SPIRE.

[72]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[73]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[74]  Juha Kärkkäinen Repetition-Based Text Indexes , 1999 .

[75]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[76]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[77]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multi-labeled trees , 2007, SODA '07.

[78]  Wing-Kai Hon,et al.  String Retrieval for Multi-pattern Queries , 2010, SPIRE.

[79]  Tak Wah Lam,et al.  Improved Approximate String Matching Using Compressed Suffix Data Structures , 2005, ISAAC.

[80]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.