Optimizing data popularity conscious bloom filters

Bloom filters are compact set representations that support set membership queries with small, one-sided error probabilities. Standard Bloom filters are oblivious to object popularity in sets and membership queries. However, sets and queries in many distributed applications follow known, stable, highly skewed distributions (e.g., Zipf-like). This paper studies the problem of minimizing the false-positive probability of a Bloom filter by adapting the number of hashes used for each data object to its popularity in sets and membership queries. We model the problem as a constrained nonlinear integer program and propose two polynomial-time solutions with bounded approximation ratios -- one is a 2-approximation algorithm with O(Nc) running time (c ≥ 6 in practice); the other is a (2+ε)-approximation algorithm with running time O(N2/ε)$, ε > 0. Here N denotes the total number of distinct data objects that appear in sets or queries. We quantitatively evaluate our proposed approach on two distributed applications (cooperative caching and full-text keyword searching) driven by real-life data traces. Compared to standard Bloom filters, our data popularity-conscious Bloom filters achieve up to 24 and 27 times false-positive probability reduction for the two applications respectively. The quantitative evaluation also validates our solution's bounded approximation ratio to the optimal.

[1]  Venugopalan Ramasubramanian,et al.  Optimal Resource Utilization in Content Distribution Networks , 2005 .

[2]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[3]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[4]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[5]  James K. Mullin,et al.  Optimal Semijoins for Distributed Database Systems , 1990, IEEE Trans. Software Eng..

[6]  László Böszörményi,et al.  A survey of Web cache replacement strategies , 2003, CSUR.

[7]  John Kubiatowicz,et al.  Probabilistic location and routing , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[8]  Abhishek Kumar,et al.  Efficient and scalable query routing for unstructured peer-to-peer networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[9]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[10]  Udi Manber,et al.  An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[11]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[12]  Eugene H. Spafford Preventing Weak Password Choices , 1991 .

[13]  Michael Mitzenmacher,et al.  Less Hashing, Same Performance: Building a Better Bloom Filter , 2006, ESA.

[14]  Jie Gao,et al.  Weighted Bloom filter , 2006, 2006 IEEE International Symposium on Information Theory.

[15]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[16]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[18]  Bernard Chazelle,et al.  The Bloomier filter: an efficient data structure for static support lookup tables , 2004, SODA '04.

[19]  Ming Zhong,et al.  Correlation-Aware Object Placement for Multi-Object Operations , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[20]  Fang Hao,et al.  Building high accuracy bloom filters using partitioned hashing , 2007, SIGMETRICS '07.

[21]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[22]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[23]  M. V. Ramakrishna,et al.  Practical performance of Bloom filters and parallel free-text searching , 1989, CACM.

[24]  Stavros G. Kolliopoulos,et al.  Partially-Ordered Knapsack and Applications to Scheduling , 2002, ESA.

[25]  Virgílio A. F. Almeida,et al.  Characterizing reference locality in the WWW , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[26]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[27]  Jeffrey Considine,et al.  Informed content delivery across adaptive overlay networks , 2002, IEEE/ACM Transactions on Networking.

[28]  S. Srinivasa Rao,et al.  An optimal Bloom filter replacement , 2005, SODA '05.

[29]  D. S. Johnson,et al.  On Knapsacks, Partitions, and a New Dynamic Programming Technique for Trees , 1983, Math. Oper. Res..

[30]  Ming Zhong,et al.  Replication degree customization for high availability , 2008, Eurosys '08.