Finding needles in a hay stream: On persistent item lookup in data streams

Abstract In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.

[1]  Ramesh K. Sitaraman,et al.  The power of two random choices: a survey of tech-niques and results , 2001 .

[2]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[3]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[4]  Haipeng Dai,et al.  Finding Persistent Items in Data Streams , 2016, Proc. VLDB Endow..

[5]  Marios Hadjieleftheriou,et al.  Finding the frequent items in streams of data , 2009, CACM.

[6]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[7]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[8]  Andrew S. Tanenbaum,et al.  Structured Computer Organization , 1976 .

[9]  Min Chen,et al.  Persistent Spread Measurement for Big Network Data Based on Register Intersection , 2017, SIGMETRICS 2017.

[10]  Michael Mitzenmacher,et al.  More Robust Hashing: Cuckoo Hashing with a Stash , 2008, ESA.

[11]  Bibudh Lahiri,et al.  Space‐efficient tracking of persistent items in a massive data stream , 2014 .

[12]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[13]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[14]  Berthold Vöcking,et al.  How asymmetry helps load balancing , 1999, JACM.

[15]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[16]  Shigang Chen,et al.  Fast and compact per-flow traffic measurement through randomized counter sharing , 2011, 2011 Proceedings IEEE INFOCOM.