Timely Reporting of Heavy Hitters using External Memory

Given an input stream of size N, a φ-heavy hitter is an item that occurs at least φ N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φ N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ~100K observations per second.

[1]  David P. Woodruff,et al.  Beating CountSketch for heavy hitters in insertion streams , 2015, STOC.

[2]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[3]  Suresh Venkatasubramanian,et al.  On external memory graph traversal , 2000, SODA '00.

[4]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[5]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[6]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[7]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[8]  Xenofontas A. Dimitropoulos,et al.  Probabilistic lossy counting: an efficient algorithm for finding heavy hitters , 2008, CCRV.

[9]  Prosenjit Bose,et al.  Bounds for Frequency Estimation of Packet Streams , 2003, SIROCCO.

[10]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[11]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[12]  David P. Woodruff,et al.  An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems , 2016, PODS.

[13]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[14]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[15]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[16]  Mourad Debbabi,et al.  SONAR: Automatic Detection of Cyber Security Events over the Twitter Stream , 2017, ARES.

[17]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[18]  David P. Woodruff,et al.  BPTree: an $\ell_2$ heavy hitters algorithm using constant memory , 2016 .

[19]  David P. Woodruff,et al.  BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory , 2016, PODS.

[20]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[21]  Steve Plimpton,et al.  FireHose Streaming Benchmarks , 2015 .

[22]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[23]  Eric Torng,et al.  Fast Regular Expression Matching Using Small TCAMs for Network Intrusion Detection and Prevention Systems , 2010, USENIX Security Symposium.

[24]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[25]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[26]  Erik D. Demaine,et al.  Cache-oblivious dynamic dictionaries with update/query tradeoffs , 2010, SODA '10.

[27]  Cynthia A. Phillips,et al.  Write-Optimized Skip Lists , 2017, PODS.

[28]  J. Ian Munro,et al.  Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[29]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[30]  Michael A. Bender,et al.  An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[31]  John Iacono,et al.  Using hashing to solve the dictionary problem , 2012, SODA.

[32]  Philip Shilane,et al.  Optimal Hashing in External Memory , 2018, ICALP.

[33]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[34]  E. Litvinov,et al.  Real-time Stability in Power Systems: Techniques for Early Detection of the Risk of Blackout [Book Review] , 2006, IEEE Power and Energy Magazine.

[35]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[36]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[37]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[38]  Daniel Barbará,et al.  The Characterization of Continuous Queries , 1999, Int. J. Cooperative Inf. Syst..

[39]  Piotr Indyk,et al.  Space-optimal heavy hitters with strong error bounds , 2010, TODS.

[40]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[41]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[42]  Thiemo Voigt,et al.  SVELTE: Real-time intrusion detection in the Internet of Things , 2013, Ad Hoc Networks.

[43]  Lada A. Adamic Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .

[44]  Csaba D. Tóth,et al.  Space complexity of hierarchical heavy hitters in multi-dimensional data streams , 2005, PODS '05.

[45]  Mladen Kezunovic Monitoring of Power System Topology in Real-Time , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[46]  Vern Paxson,et al.  Shunting: a hardware/software architecture for flexible, high-performance network intrusion prevention , 2007, CCS '07.

[47]  Lixia Zhang,et al.  BGPmon: A Real-Time, Scalable, Extensible Monitoring System , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[48]  Hui Zang,et al.  Is sampled data sufficient for anomaly detection? , 2006, IMC '06.

[49]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[50]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[51]  Jonathan W. Berry,et al.  Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments: LDRD Report , 2018 .

[52]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[53]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[54]  Robert D. Carr,et al.  Designing Contamination Warning Systems for Municipal Water Networks Using Imperfect Sensors , 2009 .

[55]  PeiJian,et al.  Efficient computation of Iceberg cubes with complex measures , 2001 .

[56]  Mikkel Thorup,et al.  Heavy Hitters via Cluster-Preserving Clustering , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[57]  Michael A. Bender,et al.  Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design , 2019, SPAA.