A near-optimal algorithm for computing the entropy of a stream

We describe a simple algorithm for approximating the empirical entropy of a stream of <i>m</i> values in a single pass, using <i>O</i>(ε<sup>-2</sup> log(Δ<sup>-1</sup>) log <i>m</i>) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε<sup>-2</sup> / log(ε<sup>-1</sup>)), meaning that our algorithm is near-optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to <i>k</i>th order entropy requires close to linear space for all <i>k</i> ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph.

[1]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[2]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[3]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[4]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[5]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[6]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[7]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[8]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[9]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[10]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[11]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[12]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[13]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[14]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Prosenjit Bose,et al.  Bounds for Frequency Estimation of Packet Streams , 2003, SIROCCO.

[17]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[18]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[19]  Bernhard Plattner,et al.  Entropy based worm and anomaly detection in fast IP networks , 2005, 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE'05).

[20]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[21]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[22]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[23]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[24]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[25]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[26]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[27]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[28]  Donald F. Towsley,et al.  Detecting anomalies in network traffic using maximum entropy estimation , 2005, IMC '05.

[29]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[30]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..