Finding duplicates in a data stream

Given a data stream of length n over an alphabet [m] where n > m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m)3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n > m, under which one can find duplicates efficiently.

[1]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[2]  Oded Goldreich,et al.  On the power of two-point based sampling , 1989, J. Complex..

[3]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[4]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[5]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[7]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[8]  Jun Tarui,et al.  Finding a Duplicate and a Missing Item in a Stream , 2007, TAMC.

[9]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[10]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[11]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[12]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[13]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[15]  Divyakant Agrawal,et al.  Duplicate detection in click streams , 2005, WWW '05.