Hokusai - Sketching Streams in Real Time

We describe Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the CountMin sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  R. Nelsen An Introduction to Copulas (Springer Series in Statistics) , 2006 .

[3]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[4]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[7]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Andrea Montanari,et al.  Counter braids: a novel counter architecture for per-flow measurement , 2008, SIGMETRICS '08.

[10]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[11]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[12]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[15]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[16]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.