On Frequency Estimation and Detection of Frequent Items in Time Faded Streams

We deal with the problem of detecting frequent items in a stream under the constraint that items are weighted, and recent items must be weighted more than older ones. This kind of problem naturally arises in a wide class of applications in which recent data is considered more useful and valuable with regard to older, stale data. The weight assigned to an item is, therefore, a function of its arrival timestamp. As a consequence, whilst in traditional frequent item mining applications we need to estimate frequency counts, we are instead required to estimate decayed counts. These applications are said to work in the time fading model. Two sketch-based algorithms for processing time-decayed streams have been recently published independently near the end of 2016. The Filtered Space Saving with Quasi-Heap (FSSQ) algorithm, besides a sketch, also uses an additional data structure called quasi-heap to maintain frequent items. Forward Decay Count-Min Space Saving (FDCMSS), our algorithm, cleverly combines key ideas borrowed from forward decay, the Count-Min sketch and the Space Saving algorithm. Therefore, it makes sense to compare and contrast the two algorithms in order to fully understand their strengths and weaknesses. We show, through extensive experimental results, that FSSQ is better for detecting frequent items than for frequency estimation. The use of the quasi-heap data structure slows down the algorithm owing to the huge number of maintenance operations. Therefore, FSSQ may not be able to cope with high-speed data streams. FDCMSS is better suitable for frequency estimation; moreover, it is extremely fast and can be used in the context of high-speed data streams and for the detection of frequent items as well, since its recall is always greater than 99%, even when using an extremely tiny amount of space. Therefore, FDCMSS proves to be an overall good choice when considering jointly the recall, precision, average relative error and the speed.

[1]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[2]  Ling Chen,et al.  Mining frequent items in data stream using time fading model , 2014, Inf. Sci..

[3]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[4]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[5]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[6]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[7]  Yu Zhang,et al.  An efficient framework for parallel and continuous frequent item monitoring , 2014, Concurr. Comput. Pract. Exp..

[8]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[9]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[10]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2017, Lecture Notes in Computer Science.

[11]  João Paulo Carvalho,et al.  Finding top-k elements in data streams , 2010, Inf. Sci..

[12]  Massimo Cafaro,et al.  Finding frequent items in parallel , 2011, Concurr. Comput. Pract. Exp..

[13]  Scott Shenker,et al.  Approximate fairness through differential dropping , 2003, CCRV.

[14]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[15]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[16]  Ugo Erra,et al.  Frequent Items Mining Acceleration Exploiting Fast Parallel Sorting on the GPU , 2012, ICCS.

[17]  Gustavo Alonso,et al.  Efficient frequent item counting in multi-core hardware , 2012, KDD.

[18]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[19]  Yunjun Gao,et al.  Novel structures for counting frequent items in time decayed streams , 2017, World Wide Web.

[20]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[21]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[22]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[23]  Marco Pulimeno,et al.  Merging Frequent Summaries , 2016, ICTCS.

[24]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[25]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[26]  Divesh Srivastava,et al.  Forward Decay: A Practical Time Decay Model for Streaming Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Marco Pulimeno,et al.  Mining frequent items in the time fading model , 2016, Inf. Sci..

[29]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[30]  Shyam Antony,et al.  Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams , 2009, Proc. VLDB Endow..

[31]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[32]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[33]  Marco Pulimeno,et al.  Parallel space saving on multi‐ and many‐core processors , 2016, Concurr. Comput. Pract. Exp..

[34]  Alexander Gelbukh Computational linguistics and intelligent text processing : 5th International Conference, CICLing 2004, Seoul, Korea, February 15-21, 2004 : proceedings , 2004 .

[35]  Kun-Lung Wu,et al.  Parallel streaming frequency-based aggregates , 2014, SPAA.

[36]  Graham Cormode,et al.  Exponentially Decayed Aggregates on Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[37]  Yu Zhang,et al.  Parallelizing the Weighted Lossy Counting Algorithm in High-speed Network Monitoring , 2012, 2012 Second International Conference on Instrumentation, Measurement, Computer, Communication and Control.

[38]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[39]  Marco Pulimeno,et al.  A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution , 2014, Inf. Sci..