Memory-Compact Membership Lookup for Multiple Data Sets by a Single Bloom Filter

Bloom filter is a memory-compact data structure to encode a set of data items, which can address the set membership query with no false negative and a configurable false positive rate. It is a fundamental tool with a wide range of applications in multiple disciplines, such as data science, networking, computer architecture, and distributed computing. However, Bloom filter faces a challenge of memory allocation: How much memory should be given to its data structure when its encoded data set is dynamically formed and has no prior-known set size. As a result, when more set elements continuously arrives, its data structure will become more crowded, causing its false positive rate of addressing membership query to increase. This problem becomes even more challenging, when there are multiple data sets to represent and each data set is independently formed in a streaming fashion. The traditional way to support the set membership checking for multiple data sets is to allocate each data set a separate Bloom filter. Instead, this paper takes a dramatically different approach: We encode all data sets in a single large filter and yet supports membership lookup for all of them, with a false positive rate bound that is independently configurable for each set. We analyze the properties of the filter and, in particular, the formulas for its feasible region where the false positive rate requirements are met for all data sets.

[1]  Shigang Chen,et al.  One memory access bloom filters and their generalization , 2011, 2011 Proceedings IEEE INFOCOM.

[2]  Yan Jia,et al.  Counting Data Stream Based on Improved Counting Bloom Filter , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[3]  Bryan O'Sullivan,et al.  Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell , 2009, PADL.

[4]  Rafael M. Gasca,et al.  Mesh Network Firewalling with Bloom Filters , 2007, 2007 IEEE International Conference on Communications.

[5]  Shigang Chen,et al.  Fast Bloom Filters and Their Generalization , 2014, IEEE Transactions on Parallel and Distributed Systems.

[6]  A. Kumar,et al.  Space-code bloom filter for efficient per-flow traffic measurement , 2004, IEEE INFOCOM 2004.

[7]  Haoyu Song,et al.  Fast hash table lookup using extended bloom filter: an aid to network processing , 2005, SIGCOMM '05.

[8]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[9]  Fang Hao,et al.  IPv6 Lookups using Distributed and Load Balanced Bloom Filters for 100Gbps Core Router Line Cards , 2009, IEEE INFOCOM 2009.

[10]  Abhishek Kumar,et al.  Efficient and scalable query routing for unstructured peer-to-peer networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Yu Hua,et al.  Using Parallel Bloom Filters for Multiattribute Representation on Network Services , 2010, IEEE Transactions on Parallel and Distributed Systems.

[13]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[14]  James K. Mullin,et al.  Optimal Semijoins for Distributed Database Systems , 1990, IEEE Trans. Software Eng..

[15]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[17]  Hongjun Lu,et al.  Bloom Histogram: Path Selectivity Estimation for XML Data with Updates , 2004, VLDB.

[18]  Shigang Chen,et al.  Fast and compact per-flow traffic measurement through randomized counter sharing , 2011, 2011 Proceedings IEEE INFOCOM.

[19]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[20]  Walid A. Najjar,et al.  Automatic Compilation Framework for Bloom Filter Based Intrusion Detection , 2006, ARC.

[21]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[22]  Yi Lu,et al.  Robust Counting Via Counter Braids: An Error-Resilient Network Measurement Architecture , 2009, IEEE INFOCOM 2009.