Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing

Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great challenges to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter (CF), that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot items, our filter captures cold items in the first stage, and hot items in the second stage. Also, existing filters require two-direction communication - with frequent exchanges between the two stages; our filter on the other hand is one-direction - each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, giving it a genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on three typical stream processing tasks and experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times. All source code is made publicly available at Github.

[1]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[2]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[3]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[4]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[5]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[6]  George Varghese,et al.  Hash-Based Techniques for High-Speed Packet Processing , 2010, Algorithms for Next Generation Networks.

[7]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[8]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[9]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[10]  Yan Chen,et al.  Reversible sketches for efficient and accurate change detection over network data streams , 2004, IMC '04.

[11]  Duane Wessels,et al.  High‐performance benchmarking with Web Polygraph , 2004, Softw. Pract. Exp..

[12]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[13]  Tong Yang,et al.  ABC: A practicable sketch framework for non-uniform multisets , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[14]  Rasmus Pagh,et al.  Lossy Dictionaries , 2001, ESA.

[15]  S. W. Roberts Control chart tests based on geometric moving averages , 2000 .

[16]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[17]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[18]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[19]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[20]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[21]  S. W. Roberts,et al.  Control Chart Tests Based on Geometric Moving Averages , 2000, Technometrics.

[22]  Sudipto Guha,et al.  Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams , 2009, SIAM J. Comput..

[23]  Andrea Montanari,et al.  Counter braids: a novel counter architecture for per-flow measurement , 2008, SIGMETRICS '08.

[24]  Philip S. Yu,et al.  On Efficient Query Processing of Stream Counts on the Cell Processor , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[26]  Gaogang Xie,et al.  A Shifting Bloom Filter Framework for Set Queries , 2015, Proc. VLDB Endow..

[27]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[28]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[29]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[30]  Ming-Yang Kao,et al.  Reversible sketches: enabling monitoring and analysis over high-speed data streams , 2007, TNET.

[31]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[32]  Graham Cormode,et al.  Sketch Techniques for Approximate Query Processing , 2010 .

[33]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[34]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[35]  Haipeng Dai,et al.  Finding Persistent Items in Data Streams , 2016, Proc. VLDB Endow..

[36]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[37]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[38]  Charu C. Aggarwal,et al.  gSketch: On Query Estimation in Graph Streams , 2011, Proc. VLDB Endow..

[39]  Shigang Chen,et al.  One memory access bloom filters and their generalization , 2011, 2011 Proceedings IEEE INFOCOM.

[40]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[41]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[42]  Monika Henzinger,et al.  Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[43]  Peng Liu,et al.  One Memory Access Sketch: A More Accurate and Faster Sketch for Per-Flow Measurement , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[44]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[45]  Graham Cormode,et al.  Sketch Algorithms for Estimating Point Queries in NLP , 2012, EMNLP.

[46]  Xiaoyong Du,et al.  Persistent Data Sketching , 2015, SIGMOD Conference.

[47]  Jiawei Han,et al.  Learning search tasks in queries and web pages via graph regularization , 2011, SIGIR '11.

[48]  Hao Wang,et al.  Fine-Grained Probability Counting: Refined LogLog Algorithm , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[49]  Wei Wang,et al.  Noisy Bloom Filters for Multi-Set Membership Testing , 2016, SIGMETRICS.