Compact Set Representation for Information Retrieval

Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using non-sequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best non-sequential array-based intersection algorithms. By adding a simple, low-cost, auxiliary index, we show that compressed storage need not hinder efficient and high-speed intersection operations.

[1]  Frank K. Hwang,et al.  A Simple Algorithm for Merging Two Disjoint Linearly-Ordered Sets , 1972, SIAM J. Comput..

[2]  Claire Mathieu,et al.  Adaptive intersection and t-threshold problems , 2002, SODA '02.

[3]  Guy Joseph Jacobson,et al.  Succinct static data structures , 1988 .

[4]  Rasmus Pagh Low Redundancy in Static Dictionaries with Constant Query Time , 2001, SIAM J. Comput..

[5]  C. SIAMJ. LOW REDUNDANCY IN STATIC DICTIONARIES WITH CONSTANT QUERY TIME , 2001 .

[6]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[7]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[8]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[9]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[10]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[11]  David Richard Clark,et al.  Compact pat trees , 1998 .

[12]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[13]  Guy E. Blelloch,et al.  Compact representations of ordered sets , 2004, SODA '04.

[14]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[15]  Wing-Kai Hon,et al.  Compressed Dictionaries: Space Measures, Data Sets, and Experiments , 2006, WEA.

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[18]  Andrew Chi-Chih Yao,et al.  An Almost Optimal Algorithm for Unbounded Searching , 1976, Inf. Process. Lett..

[19]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[20]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[21]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.