Index Compression for BitFunnel Query Processing

Large-scale search engines utilize inverted indexes which store ordered lists of document identifies (docIDs) relevant to query terms, which can be queried thousands of times per second. In order to reduce storage requirements, we propose a dictionary-based compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter. Compression is achieved through storing frequently occurring blocks in a dictionary. Infrequently occurring blocks (those which are not represented in the dictionary) are instead referenced using similar blocks that are in the dictionary, introducing additional false positive errors. We further introduce a docID reordering strategy to improve compression. Experimental results indicate an improvement in compression by 27% to 30%, at the expense of increasing the query processing time by 16% to 48% and increasing the false positive rate by around 7.6 to 10.7 percentage points.

[1]  Sameh Elnikety,et al.  BitFunnel: Revisiting Signatures for Search , 2017, SIGIR.

[2]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[3]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[4]  Yusheng Ji,et al.  Improved Weighted Bloom Filter and Space Lower Bound Analysis of Algorithms for Approximated Membership Querying , 2015, DASFAA.

[5]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Owen Kaser,et al.  Sorting improves word-aligned bitmap indexes , 2010, Data Knowl. Eng..

[9]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).