Building Fast and Compact Sketches for Approximately Multi-Set Multi-Membership Querying

Given a set S, Membership Querying (MQ) answers whether a query element $q\in S$. It is a fundamental task in areas like database systems and computer networks. In this paper, we consider a more general problem, Multi-Set Multi-Membership Querying (MS-MMQ). Given n sets $S_0,łdots,S_n-1 $, MS-MMQ answers which sets contain element q. A direct way to address MS-MMQ is to build an MQ structure (e.g., Bloom Filter) for each set. However, the query and space complexities grow linearly with n and become prohibitive for a large n. To address this challenge, we propose a novel Circular Shift and Coalesce (CSC) framework to efficiently achieve approximate MS-MMQ. Instead of building an MQ data structure for each set, the CSC index encodes all n sets into a compact sketch and retrieves only a few bytes in the sketch for a query, which achieves high memory-efficiency and boosts the query speed by several times. CSC is compatible with mainstream data structures for Approximate MQ. We conduct experiments on real-world datasets and results demonstrate that our framework is up to 91.2 times faster and up to 48.9 times more accurate than state-of-the-art methods.

[1]  Gaogang Xie,et al.  A Shifting Bloom Filter Framework for Set Queries , 2015, Proc. VLDB Endow..

[2]  Jie Wu,et al.  The dynamic cuckoo filter , 2017, 2017 IEEE 25th International Conference on Network Protocols (ICNP).

[3]  Peng Jia,et al.  Detecting a Variety of Long-Term Stealthy User Behaviors on High Speed Links , 2019, IEEE Transactions on Knowledge and Data Engineering.

[4]  Chen Qian,et al.  Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters , 2019, Proc. VLDB Endow..

[5]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[6]  Anshumali Shrivastava,et al.  Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO) , 2019 .

[7]  Vadim A. Raikhlin,et al.  Clusterix-Like BigData DBMS , 2020, Data Science and Engineering.

[8]  Shigang Chen,et al.  When Bloom Filters Are No Longer Compact: Multi-Set Membership Lookup for Network Applications , 2016, IEEE/ACM Transactions on Networking.

[9]  Sergey Bartunov,et al.  Meta-Learning Neural Bloom Filters , 2019, ICML.

[10]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[11]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[12]  Tong Yang,et al.  Coloring Embedder: A Memory Efficient Data Structure for Answering Multi-set Query , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[13]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[14]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[15]  Salvatore Pontarelli,et al.  Adaptive Cuckoo Filters , 2017, ALENEX.

[16]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[17]  Kang Li,et al.  Approximate caches for packet classification , 2004, IEEE INFOCOM 2004.

[18]  Yingyuan Xiao,et al.  D-Ary Cuckoo Filter: A Space Efficient Data Structure for Set Membership Lookup , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[19]  Deke Guo,et al.  Optimizing Bloom Filter: Challenges, Solutions, and Comparisons , 2018, IEEE Communications Surveys & Tutorials.

[20]  Yufei Wang,et al.  Magic Cube Bloom Filter: Answering Membership Queries for Multiple Sets , 2019, 2019 IEEE International Conference on Big Data and Smart Computing (BigComp).

[21]  Christian Esteve Rothenberg,et al.  The deletable Bloom filter: a new member of the Bloom family , 2010, IEEE Communications Letters.

[22]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Optimizing by Sandwiching , 2018, NeurIPS.

[23]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[24]  Fang Hao,et al.  Fast Multiset Membership Testing Using Combinatorial Bloom Filters , 2009, IEEE INFOCOM 2009.

[25]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[26]  James K. Mullin,et al.  A second look at bloom filters , 1983, CACM.

[27]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[28]  Nuwan Jayasena,et al.  Morton Filters: Faster, Space-Efficient Cuckoo Filters via Biasing, Compression, and Decoupled Logical Sparsity , 2018, Proc. VLDB Endow..

[29]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[30]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[31]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[32]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[33]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[34]  Minlan Yu,et al.  BUFFALO: bloom filter forwarding architecture for large organizations , 2009, CoNEXT '09.

[35]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[36]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2016, bioRxiv.

[37]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[38]  Jing Tao,et al.  Mining Long-Term Stealthy User Behaviors on High Speed Links , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[39]  Anshumali Shrivastava,et al.  RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time , 2019, ArXiv.

[40]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[41]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[42]  Isaac Keslassy,et al.  The Variable-Increment Counting Bloom Filter , 2012, IEEE/ACM Transactions on Networking.

[43]  Kenneth J. Christensen,et al.  A new analysis of the false positive rate of a Bloom filter , 2010, Inf. Process. Lett..

[44]  Wei Wang,et al.  Noisy Bloom Filters for Multi-Set Membership Testing , 2016, SIGMETRICS.

[45]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[46]  MyungKeun Yoon,et al.  Bloom tree: A search tree based on Bloom filters for multiple-set membership testing , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[47]  Tong Yang,et al.  A Shifting Framework for Set Queries , 2017, IEEE/ACM Transactions on Networking.

[48]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[49]  MyungKeun Yoon,et al.  Aging Bloom Filter with Two Active Buffers for Dynamic Sets , 2010, IEEE Transactions on Knowledge and Data Engineering.