Finding frequent items in data streams using hierarchical information

Finding frequent items or top-k items in data streams is a basic mining task with a wide range of applications. There are lots of algorithms proposed to enhance the performance of these algorithms, whereas not much effort has been made to make use of hierarchical information held by items in data stream. In this paper, we try to improve the accuracy of finding frequent items using hierarchical information in taxonomy. To do that, we propose a method called Merge. According to the strategy, we design and implement an algorithm, named FISHMerge. In order to evaluate the performance of the algorithm, we propose three new measures for testing, and develop a hierarchical stream data generator. After conducting a comprehensive experimental study, we conclude that accuracy of FISHMerge is better than algorithms without using hierarchical information under same amount of memory. In the meantime, our algorithm can also provide some information of higher level items.

[1]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[2]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[3]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[4]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[5]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[6]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[7]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[8]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[9]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[10]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[11]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[12]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[13]  Csaba D. Tóth,et al.  Space complexity of hierarchical heavy hitters in multi-dimensional data streams , 2005, PODS '05.

[14]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.