Mining frequent items in the time fading model

We introduce FDCMSS, a novel sketch-based algorithm for frequent items working in the time fading model. The algorithm cleverly combines key ideas borrowed from forward decay, the Count-Min and the Space Saving algorithms.We formally prove the correctness of our algorithm.We experimentally validate the algorithm on synthetic data distributed using a Zipf distribution, and also on real datasets.We compare the performances and the error committed by our algorithm against λ-HCount, an algorithm recently proposed by Chen and Mei. Extensive experimental results show that FDCMSS outperforms λ-HCount with regard to speed, space used, precision attained and error committed on both synthetic and real datasets. We present FDCMSS, a new sketch-based algorithm for mining frequent items in data streams. The algorithm cleverly combines key ideas borrowed from forward decay, the Count-Min and the Space Saving algorithms. It works in the time fading model, mining data streams according to the cash register model. We formally prove its correctness and show, through extensive experimental results, that our algorithm outperforms λ-HCount, a recently developed algorithm, with regard to speed, space used, precision attained and error committed on both synthetic and real datasets.

[1]  Gustavo Alonso,et al.  Efficient frequent item counting in multi-core hardware , 2012, KDD.

[2]  Ling Chen,et al.  Mining frequent items in data stream using time fading model , 2014, Inf. Sci..

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[5]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[6]  Peter Kulchyski and , 2015 .

[7]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[8]  Marios Hadjieleftheriou,et al.  Finding frequent items in data streams , 2008, Proc. VLDB Endow..

[9]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[10]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[11]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[12]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[13]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[14]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[16]  Divesh Srivastava,et al.  Forward Decay: A Practical Time Decay Model for Streaming Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[18]  Shyam Antony,et al.  Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams , 2009, Proc. VLDB Endow..

[19]  Yu Zhang,et al.  An efficient framework for parallel and continuous frequent item monitoring , 2014, Concurr. Comput. Pract. Exp..

[20]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[21]  Marco Pulimeno,et al.  A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution , 2014, Inf. Sci..

[22]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[23]  Alexander Gelbukh Computational Linguistics and Intelligent Text Processing, 7th International Conference, CICLing 2006, Mexico City, Mexico, February 19-25, 2006, Proceedings , 2006, CICLing.

[24]  Themis Palpanas,et al.  Identifying streaming frequent items in ad hoc time windows , 2013, Data Knowl. Eng..

[25]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[26]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[27]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[28]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[29]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[30]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[31]  Ugo Erra,et al.  Frequent Items Mining Acceleration Exploiting Fast Parallel Sorting on the GPU , 2012, ICCS.

[32]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[33]  Massimo Cafaro,et al.  Finding frequent items in parallel , 2011, Concurr. Comput. Pract. Exp..

[34]  Kun-Lung Wu,et al.  Parallel streaming frequency-based aggregates , 2014, SPAA.

[35]  Graham Cormode,et al.  Exponentially Decayed Aggregates on Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Yu Zhang,et al.  Parallelizing the Weighted Lossy Counting Algorithm in High-speed Network Monitoring , 2012, 2012 Second International Conference on Instrumentation, Measurement, Computer, Communication and Control.

[37]  Marios Hadjieleftheriou,et al.  Finding the frequent items in streams of data , 2009, CACM.

[38]  Scott Shenker,et al.  Approximate fairness through differential dropping , 2003, CCRV.

[39]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.