New algorithms on wavelet trees and applications to information retrieval

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.

[1]  Gonzalo Navarro,et al.  Fast In-Memory XPath Search over Compressed Text and Tree Indexes , 2009, ArXiv.

[2]  Michiel H. M. Smid,et al.  Range Mode and Range Median Queries on Lists and Trees , 2003, Nord. J. Comput..

[3]  Claire Mathieu,et al.  Adaptive intersection and t-threshold problems , 2002, SODA '02.

[4]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[5]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[6]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  Maxime Crochemore,et al.  Improved Algorithms for the Range Next Value Problem and Applications , 2008, STACS.

[8]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[9]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[10]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[11]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2010, Theor. Comput. Sci..

[12]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[13]  Alistair Moffat,et al.  Searching large text collections , 2002 .

[14]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[15]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[16]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[17]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[18]  Prosenjit Bose,et al.  Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing , 2009, WADS.

[19]  Allan Grønlund Jørgensen,et al.  Data Structures for Range Median Queries , 2009, ISAAC.

[20]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[21]  Prosenjit Bose,et al.  Approximate Range Mode and Range Median Queries , 2005, STACS.

[22]  Allan Grønlund Jørgensen,et al.  Range selection and median: tight cell probe lower bounds and adaptive data structures , 2011, SODA '11.

[23]  Wing-Kai Hon,et al.  Efficient Data Structures for the Orthogonal Range Successor Problem , 2009, COCOON.

[24]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[25]  Moshe Lewenstein,et al.  Range Non-overlapping Indexing and Successive List Indexing , 2007, WADS.

[26]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[27]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[28]  Gonzalo Navarro,et al.  Compressed Representations of Permutations, and Applications , 2009, STACS.

[29]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[30]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[31]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[32]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[33]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[34]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[35]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[36]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[37]  Holger Petersen Improved Bounds for Range Mode and Range Median Queries , 2008, SOFSEM.

[38]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[39]  Gonzalo Navarro,et al.  Compact Rich-Functional Binary Relation Representations , 2010, LATIN.

[40]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[41]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[42]  Ricardo A. Baeza-Yates,et al.  Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences , 2005, SPIRE.

[43]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[44]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[45]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[46]  Holger Petersen,et al.  Range mode and range median queries in constant time and sub-quadratic space , 2009, Inf. Process. Lett..

[47]  J. Shane Culpepper,et al.  Compact Set Representation for Information Retrieval , 2007, SPIRE.

[48]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[49]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[50]  Jovan Pehcevski,et al.  Evaluation of Effective XML Information Retrieval , 2006 .

[51]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[52]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[53]  Djoerd Hiemstra,et al.  The Simplest Evaluation Measures for XML Information Retrieval that Could Possibly Work , 2005 .

[54]  Gonzalo Navarro,et al.  Position-Restricted Substring Searching , 2006, LATIN.

[55]  Gonzalo Navarro,et al.  Transposition invariant string matching , 2005, J. Algorithms.

[56]  Gonzalo Navarro,et al.  An Alphabet-Friendly FM-Index , 2004, SPIRE.

[57]  S. Muthukrishnan,et al.  Range Medians , 2008, ESA.

[58]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[59]  Maxime Crochemore,et al.  Finding Patterns In Given Intervals , 2007, Fundam. Informaticae.

[60]  Gonzalo Navarro,et al.  Fast in-memory XPath search using compressed indexes , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[61]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[62]  Peter Sanders,et al.  Towards optimal range medians , 2011, Theor. Comput. Sci..

[63]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[64]  Szymon Grabowski,et al.  On efficient implementations of median filters in theory and in practice , 2009 .

[65]  Claire Mathieu,et al.  Alternation and redundancy analysis of the intersection problem , 2008, TALG.

[66]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[67]  Wing-Kai Hon,et al.  String Retrieval for Multi-pattern Queries , 2010, SPIRE.

[68]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[69]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[70]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[71]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[72]  Gonzalo Navarro,et al.  A Fun Application of Compact Data Structures to Indexing Geographic Data , 2010, FUN.

[73]  Ian H. Witten,et al.  Managing gigabytes , 1994 .