HIDS: a multifunctional generator of hierarchical data streams

In the research of high-speed data streams, large amounts of synthetic data are needed. These days, more and more researchers focus on hierarchical multi-dimensional data streams or data sets, which is beyond the ability of traditional synthetic data generators. In this paper we propose a two-phased method to generate hierarchical multi-dimensional data streams, in which a tree-like structure is built first, and then an unlimited number of items chosen among the tree leaves according to a distribution are inserted into the stream. Our generator, HIDS, integrates all of the functions of existing data generators, and can customize the tree structure according to usersý requirements, producing tree structures such as equal-depth trees, equal-fan-out trees, balanced trees and different-fan-out trees. An experimental study using real data streams shows that HIDS can generate data streams tailored to specific applications.

[1]  Csaba D. Tóth,et al.  Space complexity of hierarchical heavy hitters in multi-dimensional data streams , 2005, PODS '05.

[2]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[3]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[4]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[5]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[6]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[9]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[10]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[11]  Hongyan Liu,et al.  Finding frequent items in data streams using hierarchical information , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[12]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[13]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[14]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[15]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[16]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.