Time-weighted counting for recently frequent pattern mining in data streams

How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.

[1]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[2]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[3]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[4]  Yongsub Lim,et al.  Fast, Accurate, and Space-efficient Tracking of Time-weighted Frequent Items from Data Streams , 2014, CIKM.

[5]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[6]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[7]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[8]  Won Suk Lee,et al.  Decaying Obsolete Information in Finding Recent Frequent Itemsets over Data Streams , 2004, IEICE Trans. Inf. Syst..

[9]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[10]  Hongyan Liu,et al.  Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams , 2006, WAIM.

[11]  Themis Palpanas,et al.  Identifying streaming frequent items in ad hoc time windows , 2013, Data Knowl. Eng..

[12]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[13]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[14]  Koji Iwanuma,et al.  Resource-oriented approximation for frequent itemset mining from bursty data streams , 2014, SIGMOD Conference.

[15]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[16]  Michael J. Fischer,et al.  Finding a Majority Among N Votes. , 1982 .

[17]  Ning Zhang,et al.  A Simple but Effective Maximal Frequent Itemset Mining Algorithm over Streams , 2012, J. Softw..

[18]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[19]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[20]  CaldersToon,et al.  Mining frequent itemsets in a stream , 2014 .

[21]  Ling Chen,et al.  Frequent Items Mining on Data Stream Based on Time Fading Factor , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[22]  Robert S. Boyer,et al.  Automated Reasoning: Essays in Honor of Woody Bledsoe , 1991, Automated Reasoning.

[23]  Ruoming Jin,et al.  An algorithm for in-core frequent itemset mining on streaming data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[24]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[25]  Robert S. Boyer,et al.  MJRTY: A Fast Majority Vote Algorithm , 1991, Automated Reasoning: Essays in Honor of Woody Bledsoe.

[26]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[27]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[28]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[29]  Ling Chen,et al.  Mining frequent items in data stream using time fading model , 2014, Inf. Sci..

[30]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[31]  Erik D. Demaine,et al.  Finding frequent items in sliding windows with multinomially-distributed item frequencies , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[32]  Suh-Yin Lee,et al.  Online mining (recently) maximal frequent itemsets over data streams , 2005, 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05).

[33]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.

[34]  Tu Li,et al.  Frequent items mining on data stream using hash-table and heap , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[35]  Alex Pentland,et al.  Reality mining: sensing complex social systems , 2006, Personal and Ubiquitous Computing.

[36]  Yongsub Lim,et al.  Memory-Efficient and Accurate Sampling for Counting Local Triangles in Graph Streams , 2018, ACM Trans. Knowl. Discov. Data.

[37]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[38]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[39]  Yongsub Lim,et al.  MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams , 2015, KDD.

[40]  Christian Borgelt,et al.  Finding closed frequent item sets by intersecting transactions , 2011, EDBT/ICDT '11.

[41]  B GibbonsPhillip,et al.  New sampling-based summary statistics for improving approximate query answers , 1998 .

[42]  Wonsuk Lee,et al.  Finding maximal frequent itemsets over online data streams adaptively , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[43]  Marios Hadjieleftheriou,et al.  Finding frequent items in data streams , 2008, Proc. VLDB Endow..