Near-Optimal Approximate Duplicate-Detection in Data Streams Over Sliding Windows for the Uniform Query Frequency or Membership Likelihood

Approximate duplicate-detection (or membership query) in data streams answers the question of whether an element from a large universe U (a query element) is present in a small subsequence of a data stream or not. It is an important query that has many Internet applications, such as web crawling, social networks and so on. Existing approximate duplicatedetection methods in the sliding window model are not memoryefficient, since that they don't incorporate the information on the query frequencies and membership likelihoods of the elements in a large universe U into their data structure design, while the information can be obtained with well-developed technique. In this paper, assuming that either the query frequency or membership likelihood is uniform for all elements in U, we adopt a block-wise updating strategy to design an memory-efficient data structure, called cell Bloom filter (CEBF), and an approximate duplicate-detection algorithm based on CEBF. Suppose that the average false positive rate is " and the sliding window size is n, then the number of bits used by our method is 2 log2(e)n(log2 1 "+ 1), which is much less than those of other existing algorithms. Experimental results on synthetic data verify the effectiveness of our method.

[1]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[2]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[3]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[4]  Jie Gao,et al.  Weighted Bloom filter , 2006, 2006 IEEE International Symposium on Information Theory.

[5]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[6]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[7]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[8]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[9]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[10]  Divyakant Agrawal,et al.  Duplicate detection in click streams , 2005, WWW '05.

[11]  Yu Zhang,et al.  Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows , 2008, Journal of Computer Science and Technology.

[12]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[13]  Yong Guan,et al.  Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[14]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[15]  Yong Guan,et al.  Near-optimal approximate membership query over time-decaying windows , 2013, 2013 Proceedings IEEE INFOCOM.