The Potential of Learned Index Structures for Index Compression

Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research.

[1]  J. Shane Culpepper,et al.  Query Driven Algorithm Selection in Early Stage Retrieval , 2018, WSDM.

[2]  Sameh Elnikety,et al.  BitFunnel: Revisiting Signatures for Search , 2017, SIGIR.

[3]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[4]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[5]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.

[6]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[7]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[8]  Craig MacDonald,et al.  About learning models with multiple query-dependent features , 2013, TOIS.

[9]  Tony Russell-Rose,et al.  2dSearch: A Visual Approach to Search Strategy Formulation , 2018, DESIRES.

[10]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[11]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[12]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[13]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[14]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[15]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[16]  Jimmy J. Lin,et al.  Dynamic Cutoff Prediction in Multi-Stage Retrieval Systems , 2016, ADCS.

[17]  J. Shane Culpepper,et al.  Assessing efficiency–effectiveness tradeoffs in multi-stage retrieval systems without using relevance judgments , 2015, Information Retrieval Journal.

[18]  Cristian Rossi,et al.  Fast document-at-a-time query processing using two-tier indexes , 2013, SIGIR.

[19]  J. Shane Culpepper,et al.  Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval , 2017, SIGIR.

[20]  Alistair Moffat,et al.  Hybrid bitvector index compression , 2007 .

[21]  Jimmy J. Lin,et al.  Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures , 2013, SIGIR.

[22]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.