Histogramming Data Streams with Fast Per-Item Processing

A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating Ai by Hi = bj for i ? Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ||A -H||22 = ?i |Ai-Hi|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression.We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that||A -H||22 ? (1 + ?) ||A -Hopt||22. Our algorithm considers the data items A0,A1, . . . in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ||A||, 1/?), and determines the histogram in time poly((B, log(N), log ||A||, 1/?). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., ?(N), or worked longer, i.e., N log?(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

[1]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[2]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[3]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Anja Feldmann,et al.  Deriving traffic demands for operational IP networks: methodology and experience , 2000, SIGCOMM.

[5]  Anja Feldmann,et al.  Deriving traffic demands for operational IP networks: methodology and experience , 2000, SIGCOMM.

[6]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[7]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[8]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[9]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[10]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[12]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[13]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[14]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.