Bloom filters are probabilistic data structures that have been successfully used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the huge increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all the Bloom filters need to be searched for potential matches. As an example, in a federated cloud environment, with hundreds of geographically distributed clouds participating in the federation, information needs to be shared by the semi-autonomous cloud providers. Each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given object is in any of the sets represented by the Bloom filters, but which of the existing sets contain the given object. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. We propose Bloofi, a hierarchical index structure for Bloom filters that speeds-up the search process and can be efficiently constructed and maintained. We apply our index structure to the problem of determining the complete data provenance graph in a geographically distributed setting. Our theoretical and experimental results show that Bloofi provides a scalable and efficient solution for searching through a large number of Bloom filters.
[1]
James K. Mullin,et al.
Optimal Semijoins for Distributed Database Systems
,
1990,
IEEE Trans. Software Eng..
[2]
Yannis E. Ioannidis,et al.
Bitmap index design and evaluation
,
1998,
SIGMOD '98.
[3]
Uwe Deppisch,et al.
S-tree: a dynamic balanced signature index for office retrieval
,
1986,
SIGIR '86.
[4]
Li Fan,et al.
Summary cache: a scalable wide-area web cache sharing protocol
,
2000,
TNET.
[5]
Yossi Matias,et al.
Spectral bloom filters
,
2003,
SIGMOD '03.
[6]
Ankur Narang,et al.
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach
,
2011,
EDBT '12.
[7]
Fan Deng,et al.
Approximately detecting duplicates for streaming data using stable bloom filters
,
2006,
SIGMOD Conference.
[8]
Michael Mitzenmacher,et al.
Compressed bloom filters
,
2001,
PODC '01.
[9]
Burton H. Bloom,et al.
Space/time trade-offs in hash coding with allowable errors
,
1970,
CACM.
[10]
Wilson C. Hsieh,et al.
Bigtable: A Distributed Storage System for Structured Data
,
2006,
TOCS.
[11]
Andrei Broder,et al.
Network Applications of Bloom Filters: A Survey
,
2004,
Internet Math..
[12]
Adina Crainiceanu,et al.
Rya: a scalable RDF triple store for the clouds
,
2012,
Cloud-I '12.