Quotient hash tables: efficiently detecting duplicates in streaming data

This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions. We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD). We prove in a benchmark that QHT and QQHTD both are, at the same time more efficient and faster than any other filter from the literature, by a large margin. Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT.

[1]  Geir M. Køien A Brief Survey of Nonces and Nonce Usage , 2015, SECURWARE 2015.

[2]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[3]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[4]  Divyakant Agrawal,et al.  Duplicate detection in click streams , 2005, WWW '05.

[5]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[6]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[7]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[8]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  张育,et al.  Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows , 2008 .

[10]  Ankur Narang,et al.  Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams , 2013, Proc. VLDB Endow..

[11]  David P. Woodruff,et al.  Optimal Lower Bounds for Universal Relation, and for Samplers and Finding Duplicates in Streams , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[12]  Moni Naor,et al.  Bloom Filters in Adversarial Environments , 2014, CRYPTO.

[13]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[14]  Jie Wu,et al.  The Dynamic Bloom Filters , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  MyungKeun Yoon,et al.  Aging Bloom Filter with Two Active Buffers for Dynamic Sets , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[17]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[18]  Jie Wu,et al.  The dynamic cuckoo filter , 2017, 2017 IEEE 25th International Conference on Network Protocols (ICNP).

[19]  Pierre Wolper,et al.  Reliable Hashing without Collosion Detection , 1993, CAV.

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Per Runeson,et al.  A replicated study on duplicate detection: using apache lucene to search among Android defects , 2014, ESEM '14.