Mining frequent items in data stream using time fading model

We investigate the problem of finding frequent items in a continuous data stream, and present an algorithm named @l-HCount for computing frequency counts of stream data based on a time fading model. The algorithm uses r hash functions to estimate the density values of stream data items. To emphasize the importance of recent data items, a time fading factor is used. For a given error bound, our algorithm can detect approximate frequent items under a certain probability using limited number of memory space. The memory requirement only depends on the number of different data items and the number of hash functions used. Experimental results on synthetic and real data sets show that our algorithm outperforms other methods in terms of accuracy, memory requirement, and processing speed.

[1]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[2]  Patrick Valduriez,et al.  Best position algorithms for efficient top-k query processing , 2011, Inf. Syst..

[3]  João Paulo Carvalho,et al.  Finding top-k elements in data streams , 2010, Inf. Sci..

[4]  Toon Calders,et al.  Mining Frequent Itemsets in a Stream , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Marios Hadjieleftheriou,et al.  Finding the frequent items in streams of data , 2009, CACM.

[6]  Yongping Du,et al.  P-top-k Queries in a Probabilistic Framework from Information Extraction Models , 2011, Comput. Math. Appl..

[7]  Tu Li,et al.  Frequent items mining on data stream using hash-table and heap , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[8]  Raymond Chi-Wing Wong,et al.  Mining top-K frequent itemsets from data streams , 2006, Data Mining and Knowledge Discovery.

[9]  Hing-Fung Ting,et al.  An Ω(1/ε log 1/ε) space lower bound for finding ε-approximate quantiles in a data stream , 2010 .

[10]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[11]  Hongyan Liu,et al.  Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams , 2006, WAIM.

[12]  Hervé Brönnimann,et al.  Deterministic algorithms for sampling count data , 2008, Data Knowl. Eng..

[13]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[14]  Toon Calders,et al.  Mining top-k frequent items in a data stream with flexible sliding windows , 2010, KDD.

[15]  Hadi Sadoghi Yazdi,et al.  Online neural network model for non-stationary and imbalanced data stream classification , 2014, Int. J. Mach. Learn. Cybern..

[16]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[17]  Bibudh Lahiri,et al.  Identifying frequent items in a network using gossip , 2010, J. Parallel Distributed Comput..

[18]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[19]  Li Jian-Zhong,et al.  An Efficient Algorithm for Mining Approximate Frequent Item over Data Streams , 2007 .

[20]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[21]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[22]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[23]  Yong Guan,et al.  Frequency Estimation over Sliding Windows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[25]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[26]  Ling Chen,et al.  Frequent Items Mining on Data Stream Based on Time Fading Factor , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[27]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[28]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[29]  João Paulo Carvalho,et al.  Finding top-k elements in a time-sliding window , 2011, Evol. Syst..

[30]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[31]  Ling Chen,et al.  An Algorithm for Mining Frequent Stream Data Items Using Hash Function and Fading Factor , 2011 .

[32]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[33]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[34]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[35]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[36]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[37]  Bin Jiang,et al.  Ranking uncertain sky: The probabilistic top-k skyline operator , 2011, Inf. Syst..

[38]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[39]  Srikanta Tirthapura,et al.  A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window , 2007, STACS.

[40]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[41]  Bill Lin,et al.  Adaptive Frequency Counting over Bursty Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[42]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[43]  Erik D. Demaine,et al.  Finding frequent items in sliding windows with multinomially-distributed item frequencies , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[44]  Ling Chen,et al.  An Algorithm for Mining Frequent Items on Data Stream Using Fading Factor , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.