Space Efficient Streaming Algorithms for the Maximum Error Histogram

We propose new algorithms for constructing maximum error (L∞) histograms in the data stream model. Our first algorithm (Min-Merge) achieves the following performance guarantee: using O(B) memory, it constructs a 2B-bucket histogram whose approximation error is at most the error of the optimal B-bucket histogram. Our second algorithm (Min-Increment) achieves a (1 + ε)-approximation of a B-bucket histogram using O(ε-1 B log U) space, where U is the size of the domain for data values. The memory requirements of these algorithms are a significant improvement over the previous best schemes for constructing near-optimal histograms in the data stream model, making them ideal for data summary applications where memory is at a premium, such as wireless sensor networks. Our Min-Increment algorithm also extends to the sliding window model without any asymptotic increase in space. Finally, using synthetic and real-world data, we show that our algorithms are indeed as space-efficient in practice as their theoretical analysis predicts - compared to previous best algorithms, they require two or more orders of magnitude less memory for the same approximation error.

[1]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[3]  Nabil H. Mustafa,et al.  Near-Linear Time Approximation Algorithms for Curve Simplification , 2005, Algorithmica.

[4]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[5]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[6]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[7]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[8]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[9]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[11]  Nikos Mamoulis,et al.  One-Pass Wavelet Synopses for Maximum-Error Metrics , 2005, VLDB.

[12]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[13]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[14]  Timothy M. Chan Faster core-set constructions and data-stream algorithms in fixed dimensions , 2006, Comput. Geom..

[15]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[17]  Sudipto Guha,et al.  Space Efficiency in Synopsis Construction Algorithms , 2005, VLDB.

[18]  John Anderson,et al.  An analysis of a large scale habitat monitoring application , 2004, SenSys '04.

[19]  Sudipto Guha,et al.  Histograms, Wavelets, Streams, and Approximation , 2007, Handbook of Approximation Algorithms and Metaheuristics.

[20]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[21]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[22]  David E. Culler,et al.  System architecture directions for networked sensors , 2000, SIGP.