Efficiently Discovering Recent Frequent Items in Data Streams

The problem of frequent item discovery in streaming data has attracted a lot of attention lately. While the above problem has been studied extensively, and several techniques have been proposed for its solution, these approaches treat all the values of the data stream equally. Nevertheless, not all values are of equal importance. In several situations, we are interested more in the new values that have appeared in the stream, rather than in the older ones. In this paper, we address the problem of finding recentfrequent items in a data stream given a small bounded memory, and present novel algorithms to this direction. We propose a basic algorithm that extends the functionality of existing approaches by monitoring item frequencies in recent windows. Subsequently, we present an improved version of the algorithm with significantly improved performance (in terms of accuracy), at no extra memory cost. Finally, we perform an extensive experimental evaluation, and show that the proposed algorithms can efficiently identify the frequent items in ad hoc recent windows of a data stream.

[1]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[2]  Ming-Syan Chen,et al.  Sliding window filtering: an efficient method for incremental mining on a time-variant database , 2005, Inf. Syst..

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[5]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[6]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[7]  Dennis Shasha,et al.  Lots o'Ticks: real time high performance time series queries on billions of trades and quotes , 2001, SIGMOD '01.

[8]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[9]  Arbee L. P. Chen,et al.  Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window , 2005, SDM.

[10]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[11]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[12]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[13]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[15]  Dimitrios Gunopulos,et al.  Online amnesic approximation of streaming time series , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[17]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[18]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[19]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[20]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[21]  Ron Kohavi,et al.  Applications of Data Mining to Electronic Commerce , 2000, Springer US.

[22]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[23]  Chia-Hui Chang,et al.  Enhancing SWF for Incremental Association Mining by Itemset Maintenance , 2003, PAKDD.

[24]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[25]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .