A near-optimal algorithm for estimating the entropy of a stream

We describe a simple algorithm for approximating the empirical entropy of a stream of <i>m</i> values up to a multiplicative factor of (1+ε) using a single pass, <i>O</i>(ε<sup>−2</sup> log (Δ<sup>−1</sup>) log <i>m</i>) words of space, and <i>O</i>(log ε<sup>−1</sup> + log log Δ<sup>−1</sup> + log log <i>m</i>) processing time per item in the stream. Our algorithm is based upon a novel extension of a method introduced by Alon et al. [1999]. This improves over previous work on this problem. We show a space lower bound of Ω(ε<sup>−2</sup>/log<sup>2</sup> (ε<sup>−1</sup>)), demonstrating that our algorithm is near-optimal in terms of its dependency on ε. We show that generalizing to multiplicative-approximation of the <i>k</i>th-order entropy requires close to linear space for <i>k</i>≥1. In contrast we show that additive-approximation is possible in a single pass using only poly-logarithmic space. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph.

[1]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[2]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[3]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[4]  Prosenjit Bose,et al.  Bounds for Frequency Estimation of Packet Streams , 2003, SIROCCO.

[5]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[6]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[7]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[9]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, STACS.

[10]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[11]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[14]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[15]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[16]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[17]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[18]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[19]  Bernhard Plattner,et al.  Entropy based worm and anomaly detection in fast IP networks , 2005, 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE'05).

[20]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[21]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[22]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[23]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[24]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[25]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[26]  Joan Feigenbaum,et al.  On graph problems in a semi-streaming model , 2005, Theor. Comput. Sci..

[27]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[28]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[29]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[30]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[31]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[32]  Donald F. Towsley,et al.  Detecting anomalies in network traffic using maximum entropy estimation , 2005, IMC '05.

[33]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[34]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..