论文信息 - Timely Reporting of Heavy Hitters using External Memory

Timely Reporting of Heavy Hitters using External Memory

Given an input stream of size N, a φ-heavy hitter is an item that occurs at least φ N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φ N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ~100K observations per second.

[1] David P. Woodruff,et al. Beating CountSketch for heavy hitters in insertion streams , 2015, STOC.

[2] Noga Alon,et al. The space complexity of approximating the frequency moments , 1996, STOC '96.

[3] Suresh Venkatasubramanian,et al. On external memory graph traversal , 2000, SODA '00.

[4] Erik D. Demaine,et al. Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[5] Moses Charikar,et al. Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[6] Mark E. J. Newman,et al. Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[7] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.

[8] Xenofontas A. Dimitropoulos,et al. Probabilistic lossy counting: an efficient algorithm for finding heavy hitters , 2008, CCRV.

[9] Prosenjit Bose,et al. Bounds for Frequency Estimation of Packet Streams , 2003, SIROCCO.

[10] RamakrishnanRaghu,et al. Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[11] M. Newman. Power laws, Pareto distributions and Zipf's law , 2005 .

[12] David P. Woodruff,et al. An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems , 2016, PODS.

[13] Dawn Xiaodong Song,et al. New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[14] Gerth Stølting Brodal,et al. Lower bounds for external memory dictionaries , 2003, SODA '03.

[15] Michael J. Franklin,et al. Streaming Queries over Streaming Data , 2002, VLDB.

[16] Mourad Debbabi,et al. SONAR: Automatic Detection of Cyber Security Events over the Twitter Stream , 2017, ARES.

[17] Michael A. Bender,et al. A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[18] David P. Woodruff,et al. BPTree: an $\ell_2$ heavy hitters algorithm using constant memory , 2016 .

[19] David P. Woodruff,et al. BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory , 2016, PODS.

[20] Michael A. Bender,et al. Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[21] Steve Plimpton,et al. FireHose Streaming Benchmarks , 2015 .

[22] Jayadev Misra,et al. Finding Repeated Elements , 1982, Sci. Comput. Program..

[23] Eric Torng,et al. Fast Regular Expression Matching Using Small TCAMs for Network Intrusion Detection and Prevention Systems , 2010, USENIX Security Symposium.

[24] Rajeev Motwani,et al. Computing Iceberg Queries Efficiently , 1998, VLDB.

[25] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[26] Erik D. Demaine,et al. Cache-oblivious dynamic dictionaries with update/query tradeoffs , 2010, SODA '10.

[27] Cynthia A. Phillips,et al. Write-Optimized Skip Lists , 2017, PODS.

[28] J. Ian Munro,et al. Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[29] Jennifer Widom,et al. Continuous queries over data streams , 2001, SGMD.

[30] Michael A. Bender,et al. An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[31] John Iacono,et al. Using hashing to solve the dictionary problem , 2012, SODA.

[32] Philip Shilane,et al. Optimal Hashing in External Memory , 2018, ICALP.

[33] Divyakant Agrawal,et al. Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[34] E. Litvinov,et al. Real-time Stability in Power Systems: Techniques for Early Detection of the Risk of Blackout [Book Review] , 2006, IEEE Power and Energy Magazine.

[35] Jian Pei,et al. Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[36] Li Fan,et al. Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[37] Rajeev Motwani,et al. Approximate Frequency Counts over Data Streams , 2012, VLDB.

[38] Daniel Barbará,et al. The Characterization of Continuous Queries , 1999, Int. J. Cooperative Inf. Syst..

[39] Piotr Indyk,et al. Space-optimal heavy hitters with strong error bounds , 2010, TODS.

[40] Michael A. Bender,et al. Cache-oblivious streaming B-trees , 2007, SPAA '07.

[41] Raghu Ramakrishnan,et al. Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[42] Thiemo Voigt,et al. SVELTE: Real-time intrusion detection in the Internet of Things , 2013, Ad Hoc Networks.

[43] Lada A. Adamic. Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .

[44] Csaba D. Tóth,et al. Space complexity of hierarchical heavy hitters in multi-dimensional data streams , 2005, PODS '05.

[45] Mladen Kezunovic. Monitoring of Power System Topology in Real-Time , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[46] Vern Paxson,et al. Shunting: a hardware/software architecture for flexible, high-performance network intrusion prevention , 2007, CCS '07.

[47] Lixia Zhang,et al. BGPmon: A Real-Time, Scalable, Extensible Monitoring System , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[48] Hui Zang,et al. Is sampled data sufficient for anomaly detection? , 2006, IMC '06.

[49] Graham Cormode,et al. What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[50] Marios Hadjieleftheriou,et al. Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[51] Jonathan W. Berry,et al. Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments: LDRD Report , 2018 .

[52] Richard M. Karp,et al. A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[53] Patrick E. O'Neil,et al. The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[54] Robert D. Carr,et al. Designing Contamination Warning Systems for Municipal Water Networks Using Imperfect Sensors , 2009 .

[55] PeiJian,et al. Efficient computation of Iceberg cubes with complex measures , 2001 .

[56] Mikkel Thorup,et al. Heavy Hitters via Cluster-Preserving Clustering , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[57] Michael A. Bender,et al. Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design , 2019, SPAA.