Near-optimal approximate membership query over time-decaying windows

There has been a long history of finding a spaceefficient data structure to support approximate membership queries, started from Bloom's work in the 1970's. Given a set A of n items and an additional item x from the same universe U of a size m ≫ n, we want to distinguish whether x ∈ A or not, using small (limited) space. The solutions for the membership query are needed for many network applications, such as cache directory, load-balancing, security, etc. If A is static, there exist optimal algorithms to find a randomized data structure to represent A using only (1+ o(1))n log 1/δ bits, which only allows for a small false positive δ but no false negative. However, existing optimal algorithms are not practical for many Internet applications, e.g., social network services, peer-to-peer systems, network traffic monitoring, etc. They are too spaceand time-expensive due to the frequent changes in the set A, because all items are needed to recompute the optimal data structure for each change using a linear running time. In this paper, we propose a novel data structure to support the approximate membership query in the time-decaying window model. In this model, items are inserted one-by-one over a data stream, and we want to determine whether an item is among the most recent w items for any given window size w ≤ n. Our data structure only requires O(n(log 1/δ+logn)) bits and O(1) running time. We also prove a non-trivial space lower bound, i.e. (n - δm) log(n - δm) bits, which guarantees that our data structure is near-optimal. Our data structure has been evaluated using both synthetic and real data sets.

[1]  George Varghese,et al.  Beyond bloom filters: from approximate membership checks to approximate state machines , 2006, SIGCOMM.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  David Eppstein,et al.  Space-Efficient Straggler Identification in Round-Trip Data Streams Via Newton's Identities and Invertible Bloom Filters , 2007, WADS.

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[6]  Isaac Keslassy,et al.  The Variable-Increment Counting Bloom Filter , 2012, IEEE/ACM Transactions on Networking.

[7]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[8]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[9]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[10]  Marcin Zukowski,et al.  Architecture-conscious hashing , 2006, DaMoN '06.

[11]  Tim Moors,et al.  Survey of Research towards Robust Peer-to-Peer Networks: Search Methods , 2007, RFC.

[12]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[13]  Nikolas Askitis,et al.  Fast and Compact Hash Tables for Integer Keys , 2009, ACSC.

[14]  Ankur Narang,et al.  Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach , 2011, EDBT '12.

[15]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[16]  Yong Guan,et al.  Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[17]  Guy E. Blelloch,et al.  Compact dictionaries for variable-length keys and data with applications , 2008, TALG.

[18]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[19]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[20]  Srikanth Kandula,et al.  Botz-4-sale: surviving organized DDoS attacks that mimic flash crowds , 2005, NSDI.

[21]  Bruce A. Mah,et al.  An empirical model of HTTP network traffic , 1997, Proceedings of INFOCOM '97.

[22]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[23]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[24]  Larry Carter,et al.  Exact and approximate membership testers , 1978, STOC.

[25]  Divyakant Agrawal,et al.  Duplicate detection in click streams , 2005, WWW '05.

[26]  Ely Porat,et al.  An Optimal Bloom Filter Replacement Based on Matrix Solving , 2008, CSR.

[27]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[28]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.