Finding Persistent Items in Data Streams

Frequent item mining, which deals with finding items that occur frequently in a given data stream over a period of time, is one of the heavily studied problems in data stream mining. A generalized version of frequent item mining is the persistent item mining, where a persistent item, unlike a frequent item, does not necessarily occur more frequently compared to other items over a short period of time, rather persists and occurs more frequently over a long period of time. To the best of our knowledge, there is no prior work on mining persistent items in a data stream. In this paper, we address the fundamental problem of finding persistent items in a given data stream during a given period of time at any given observation point. We propose a novel scheme, PIE, that can accurately identify each persistent item with a probability greater than any desired false negative rate (FNR) while using a very small amount of memory. The key idea of PIE is that it uses Raptor codes to encode the ID of each item that appears at the observation point during a measurement period and stores only a few bits of the encoded ID in the memory of that observation point during that measurement period. The item that is persistent occurs in enough measurement periods that enough encoded bits for the ID can be retrieved from the observation point to decode them correctly and get the ID of the persistent item. We implemented and extensively evaluated PIE using three real network traffic traces and compared its performance with two prior adapted schemes. Our results show that not only PIE achieves the desired FNR in every scenario, its FNR, on average, is 19.5 times smaller than the FNR of the best adapted prior art.

[1]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[2]  Marios Hadjieleftheriou,et al.  Finding the frequent items in streams of data , 2009, CACM.

[3]  K. R. Ananda Kumar Resource Adaptive Technique for Frequent Itemset Mining in Transactional Data Streams , 2012 .

[4]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[5]  Konstantina Papagiannaki,et al.  Exploiting Temporal Persistence to Detect Covert Botnet Channels , 2009, RAID.

[6]  Paulo Cortez,et al.  Using data mining for bank direct marketing: an application of the CRISP-DM methodology , 2011 .

[7]  Thomas Stockhammer,et al.  RaptorQ Forward Error Correction Scheme for Object Delivery , 2011, RFC.

[8]  Yang Li,et al.  CASE: Cache-assisted stretchable estimator for high speed per-flow measurement , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[9]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[10]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[11]  Charu C. Aggarwal An Introduction to Sensor Data Analytics , 2013, Managing and Mining Sensor Data.

[12]  George Varghese,et al.  What's the difference?: efficient set reconciliation without prior context , 2011, SIGCOMM.

[13]  Nicole Immorlica,et al.  Click Fraud Resistant Methods for Learning Click-Through Rates , 2005, WINE.

[14]  Shigang Chen,et al.  Estimating the Persistent Spreads in High-Speed Networks , 2014, 2014 IEEE 22nd International Conference on Network Protocols.

[15]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Jason Lee,et al.  A first look at modern enterprise traffic , 2005, IMC '05.

[17]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[18]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[19]  Hongyan Liu,et al.  Methods for mining frequent items in data streams: an overview , 2009, Knowledge and Information Systems.

[20]  Thomas Stockhammer,et al.  Raptor Forward Error Correction Scheme for Object Delivery , 2007, RFC.

[21]  Mohamed Medhat Gaber,et al.  Data Stream Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[22]  J Chandrika Frequent Itemset Mining in Transactional Data Streams Based on Quality Control and Resource Adaptation , 2012 .

[23]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[24]  Jia Wang,et al.  Analyzing peer-to-peer traffic across large networks , 2002, IMW '02.

[25]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.