A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index

Search engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index BitFunnel (BF) for search efficiency, and (b) the state-of-the-art Partitioned Elias-Fano (PEF) inverted-index compression method. We use this proposed hybrid method to minimize time while satisfying a fixed space constraint, and to minimize space while satisfying a fixed time constraint. Each document is stored using either BF or PEF, and we use a local search strategy to find an approximately optimal BF-PEF partition. Since performing full experiments on each candidate BF-PEF partition is impractically slow, we use a regression model to predict the time and space costs resulting from candidate partitions (space accuracy 97.6%; time accuracy 95.2%). Compared with a hybrid mathematical index (Ottaviano et al., 2015), the time cost is reduced by up to 47% without significantly exceeding its size. Compared with three mathematical encoding methods, the hybrid BF-PEF index allows performing list intersection between around 16% to 76% faster (without significantly increasing the index size). Compared with BF, the index size is reduced by 45% while maintaining an intersection time comparable to that of BF.

[1]  Clifford Stein,et al.  Introduction to Algorithms -3/Ed. , 2012 .

[2]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[3]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[4]  MoffatAlistair,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2005 .

[5]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Giuseppe Ottaviano,et al.  Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[7]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[8]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[9]  Berkant Barla Cambazoglu,et al.  Scalability Challenges in Web Search Engines , 2015, Advanced Topics in Information Retrieval.

[10]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[11]  Rakesh Agrawal,et al.  A Study of Distinctiveness in Web Results of Two Search Engines , 2015, WWW.

[12]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[13]  Prabhakant Sinha,et al.  The Multiple-Choice Knapsack Problem , 1979, Oper. Res..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Henry Tan,et al.  Maguro, a system for indexing and searching over very large text collections , 2013, WSDM.

[16]  Yusheng Ji,et al.  Improved Weighted Bloom Filter and Space Lower Bound Analysis of Algorithms for Approximated Membership Querying , 2015, DASFAA.

[17]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[18]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.

[19]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Sameh Elnikety,et al.  BitFunnel: Revisiting Signatures for Search , 2017, SIGIR.

[22]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[23]  Gang Wang,et al.  Index Compression for BitFunnel Query Processing , 2018, SIGIR.

[24]  Zhiyong Peng,et al.  An efficient random access inverted index for information retrieval , 2010, WWW '10.

[25]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[26]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007 .

[27]  Owen Kaser,et al.  Sorting improves word-aligned bitmap indexes , 2010, Data Knowl. Eng..