Document Reordering for Faster Intersection

A lot of research has studied how to optimize inverted index structures in search engines through suitable reassignment of document identifiers. This approach was originally proposed to allow for better compression of the index, but subsequent work showed that it can also result in significant speed-ups for conjunctive queries and even certain types of disjunctive top-k algorithms. However, we do not have a good understanding of why this happens, and how we could directly optimize an index for query processing speed. As a result, existing techniques attempt to optimize for size, and treat speed increases as a welcome side-effect. In this paper, we take an initial but important step towards understanding and modeling speed increases due to document reordering. We define the problem of minimizing the cost of queries given an inverted index and a query distribution, relate it to work on adaptive set intersection, and show that it is fundamentally different from that of minimizing compressed index size. We then propose a heuristic algorithm for finding a document reordering that minimizes query processing costs under suitable cost models. Our experiments show significant increases in the speed of intersections over state-of-the-art reordering techniques. PVLDB Reference Format: Qi Wang, Torsten Suel. Document Reordering for Faster Intersection. PVLDB, 12(5): 475-487, 2019. DOI: https://doi.org/10.14778/3303753.3303755

[1]  Philip Bille,et al.  Fast Evaluation of Union-Intersection Expressions , 2007, ISAAC.

[2]  Giuseppe Ottaviano,et al.  Compressing Graphs and Indexes with Recursive Graph Bisection , 2016, KDD.

[3]  Shirish Tatikonda,et al.  On efficient posting list intersection with multicore processors , 2009, SIGIR.

[4]  Torsten Suel,et al.  Scalable techniques for document identifier assignment in inverted indexes , 2010, WWW '10.

[5]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.

[6]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[7]  Fabrizio Silvestri,et al.  VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming , 2010, CIKM.

[8]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[9]  Bhaskar Mitra,et al.  Optimizing Query Evaluations Using Reinforcement Learning for Web Search , 2018, SIGIR.

[10]  Surajit Chaudhuri,et al.  Interval-based pruning for top-k processing over compressed lists , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Gang Wang,et al.  A Batched GPU Algorithm for Set Intersection , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[12]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[13]  Guy E. Blelloch,et al.  Fast set operations using treaps , 1998, SPAA '98.

[14]  Subhash Khot,et al.  Vertex cover might be hard to approximate to within 2-/spl epsiv/ , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[15]  Andrew Trotman,et al.  Document Reordering is Good, Especially for e-Commerce , 2017, eCOM@SIGIR.

[16]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[17]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[18]  Sudipto Guha,et al.  Improving the Performance of List Intersection , 2009, Proc. VLDB Endow..

[19]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[20]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[21]  Wolfgang Lehner,et al.  Fast Sorted-Set Intersection using SIMD Instructions , 2011, ADMS@VLDB.

[22]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[23]  Bolin Ding,et al.  Fast Set Intersection in Memory , 2011, Proc. VLDB Endow..

[24]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[25]  Jimmy J. Lin,et al.  Earlybird: Real-Time Search at Twitter , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[27]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[28]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[29]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[30]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[31]  Andrew Chi-Chih Yao,et al.  An Almost Optimal Algorithm for Unbounded Searching , 1976, Inf. Process. Lett..

[32]  Rasmus Pagh,et al.  A New Data Layout for Set Intersection on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[33]  Giuseppe Ottaviano,et al.  Faster BlockMax WAND with Variable-sized Blocks , 2017, SIGIR.

[34]  Erik D. Demaine,et al.  Experiments on Adaptive Set Intersections for Text Retrieval Systems , 2001, ALENEX.

[35]  Robert E. Tarjan,et al.  A Fast Merging Algorithm , 1979, JACM.

[36]  Sameh Elnikety,et al.  BitFunnel: Revisiting Signatures for Search , 2017, SIGIR.

[37]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[38]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[39]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[40]  Roi Blanco,et al.  Document Identifier Reassignment Through Dimensionality Reduction , 2005, ECIR.

[41]  Ricardo A. Baeza-Yates,et al.  Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences , 2005, SPIRE.

[42]  Gang Wang,et al.  Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units , 2011, Proc. VLDB Endow..

[43]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[44]  Frank Wm. Tompa,et al.  Distribution by Document Size , 2014 .

[45]  Claire Mathieu,et al.  Alternation and redundancy analysis of the intersection problem , 2008, TALG.

[46]  Frank K. Hwang,et al.  Optimal merging of 2 elements with n elements , 2004, Acta Informatica.

[47]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[48]  Frank K. Hwang,et al.  A Simple Algorithm for Merging Two Disjoint Linearly-Ordered Sets , 1972, SIAM J. Comput..

[49]  Fabrizio Silvestri,et al.  Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.