Finding Significant Items in Data Streams

Finding top-k frequent items has been a hot issue in databases. Finding top-k persistent items is a new issue, and has attracted increasing attention in recent years. In practice, users often want to know which items are significant, i.e., not only frequent but also persistent. No prior art can address both of the above two issues at the same time. Also, for high-speed data streams, they cannot achieve high accuracy when the memory is tight. In this paper, we define a new issue, named finding top-k significant items, and propose a novel algorithm namely LTC to address this issue. It includes two key techniques: Long-tail Replacement and a modified CLOCK algorithm. We theoretically prove there is no overestimation error and derive the correct rate and error bound. We conduct extensive experiments on three real datasets. Our experimental results show that LTC achieves 300~10^8 and in average 10^5 times higher accuracy than other related algorithms.

[1]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[2]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[3]  Peter Reiher,et al.  A taxonomy of DDoS attack and DDoS defense mechanisms , 2004, CCRV.

[4]  Meng Qing-chun,et al.  Research on Precoding Method in Raptor Code , 2007 .

[5]  Jaminand Mitzel,et al.  Network Design and Control Using On-oo and Multi-level Source Traac Models with Long-tailed Distributions , 1997 .

[6]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[7]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[8]  Vern Paxson,et al.  TCP Congestion Control , 1999, RFC.

[9]  Allen B. Downey,et al.  Evidence for long-tailed distributions in the internet , 2001, IMW '01.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[12]  Panlong Yang,et al.  R-TTWD: Robust Device-Free Through-The-Wall Detection of Moving Human With WiFi , 2017, IEEE Journal on Selected Areas in Communications.

[13]  Haipeng Dai,et al.  Finding Persistent Items in Distributed Datasets , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[14]  Tanja Zseby,et al.  Empirical evaluation of hash functions for multipoint measurements , 2008, CCRV.

[15]  Haipeng Dai,et al.  Finding Persistent Items in Data Streams , 2016, Proc. VLDB Endow..

[16]  Zhetao Li,et al.  Achievable Rate Maximization for Cognitive Hybrid Satellite-Terrestrial Networks With AF-Relays , 2018, IEEE Journal on Selected Areas in Communications.

[17]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[18]  Katsiaryna Mirylenka,et al.  Conditional heavy hitters: detecting interesting correlations in data streams , 2015, The VLDB Journal.

[19]  Xiang-Yang Li,et al.  One More Tag Enables Fine-Grained RFID Localization and Tracking , 2018, IEEE/ACM Transactions on Networking.

[20]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[21]  Zhetao Li,et al.  Minimizing Convergecast Time and Energy Consumption in Green Internet of Things , 2020, IEEE Transactions on Emerging Topics in Computing.

[22]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[23]  Anja Feldmann,et al.  Fitting mixtures of exponentials to long-tail distributions to analyze network performance models , 1997, Proceedings of INFOCOM '97.

[24]  Ward Whitt,et al.  Waiting-time tail probabilities in queues with long-tail service-time distributions , 1994, Queueing Syst. Theory Appl..

[25]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2002, TNET.

[26]  Srikanta Tirthapura,et al.  Monitoring persistent items in the union of distributed streams , 2014, J. Parallel Distributed Comput..

[27]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[28]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[29]  Wei Wang,et al.  Noisy Bloom Filters for Multi-Set Membership Testing , 2016, SIGMETRICS.

[30]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[31]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[32]  Gaogang Xie,et al.  A Shifting Bloom Filter Framework for Set Queries , 2015, Proc. VLDB Endow..

[33]  Ada Wai-Chee Fu,et al.  Mining frequent itemsets without support threshold: with and without item constraints , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  Ward Whitt,et al.  Long-Tail Buffer-Content Distributions in Broadband Networks , 1997, Perform. Evaluation.

[35]  Zhetao Li,et al.  Dynamic Compressive Wide-Band Spectrum Sensing Based on Channel Energy Reconstruction in Cognitive Internet of Things , 2018, IEEE Transactions on Industrial Informatics.

[36]  Lei Chen,et al.  Noise Tolerant Localization for Sensor Networks , 2018, IEEE/ACM Transactions on Networking.