LogLog Filter: Filtering Cold Items within a Large Range over High Speed Data Streams

Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items’ frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog" structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to ${2^{{2^4}}}$. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.

[1]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[2]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[3]  Daniel Ting,et al.  Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation , 2017, SIGMOD Conference.

[4]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[5]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[6]  Abdesselem Kortebi,et al.  Home Networks Traffic Monitoring Case Study: Anomaly Detection , 2016, 2016 Global Information Infrastructure and Networking Symposium (GIIS).

[7]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[8]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[10]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[11]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[12]  Daniel Ting,et al.  Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions , 2018, KDD.

[13]  Graham Cormode,et al.  Sketch Algorithms for Estimating Point Queries in NLP , 2012, EMNLP.

[14]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[15]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[16]  Luc Devroye,et al.  Sample-based non-uniform random variate generation , 1986, WSC '86.

[17]  Nicolas Hohn,et al.  Inverting sampled traffic , 2003, IEEE/ACM Transactions on Networking.

[18]  Steve Uhlig,et al.  HeavyKeeper: An Accurate Algorithm for Finding Top- $k$ Elephant Flows , 2019, IEEE/ACM Transactions on Networking.

[19]  Hal Daumé,et al.  Approximate Scalable Bounded Space Sketch for Large Data NLP , 2011, EMNLP.

[20]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[21]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[22]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[23]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[24]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[25]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[26]  Xiaofang Zhou,et al.  Consensus-Based Group Task Assignment with Social Impact in Spatial Crowdsourcing , 2020, Data Science and Engineering.

[27]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[28]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[29]  Jing Tao,et al.  Mining Long-Term Stealthy User Behaviors on High Speed Links , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[30]  Philip S. Yu,et al.  On Efficient Query Processing of Stream Counts on the Cell Processor , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[31]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[32]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[33]  Lei Zou,et al.  HeavyGuardian: Separate and Guard Hot Items in Data Streams , 2018, KDD.

[34]  Jing Tao,et al.  Utilizing Dynamic Properties of Sharing Bits and Registers to Estimate User Cardinalities Over Time , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[35]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[36]  Min Chen,et al.  Hyper-Compact Virtual Estimators for Big Network Data Based on Register Sharing , 2015, SIGMETRICS.

[37]  Israel Cidon,et al.  The power of prediction: cloud bandwidth and cost reduction , 2011, SIGCOMM.

[38]  Herbert A. David,et al.  Order Statistics , 2011, International Encyclopedia of Statistical Science.

[39]  Zhipeng Cai,et al.  CoRE: Cooperative End-to-End Traffic Redundancy Elimination for Reducing Cloud Bandwidth Cost , 2012, IEEE Transactions on Parallel and Distributed Systems.