Estimating Entropy and Entropy Norm on Data Streams

We consider the problem of computing information-theoretic functions, such as entropy, on a data stream, using sublinear space. Our first result deals with a measure we call the entropy norm of an input stream: it is closely related to entropy but is structurally similar to the well-studied notion of frequency moments. We give a polylogarithmic-space, one-pass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold. Our second group of results is for estimating the empirical entropy of an input stream. We first present a sublinear-space, one-pass algorithm for this problem. For a stream of m items and a given real parameter α, our algorithm uses space Õ(m 2α) and provides an approximation of 1/α in the worst case and (1+ε) in "most" cases. We then present a two-pass, polylogarithmic-space, (1+ε)-approximation algorithm. All our algorithms are quite simple.

[1]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[4]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[5]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[6]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[7]  Theodore Johnson,et al.  Sampling algorithms in a stream operator , 2005, SIGMOD '05.

[8]  Theodore Johnson,et al.  The Gigascope Stream Database , 2003, IEEE Data Eng. Bull..

[9]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, STACS.

[10]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[11]  Robert S. Boyer,et al.  MJRTY: A Fast Majority Vote Algorithm , 1991, Automated Reasoning: Essays in Honor of Woody Bledsoe.

[12]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[13]  Donald F. Towsley,et al.  Detecting anomalies in network traffic using maximum entropy estimation , 2005, IMC '05.

[14]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[15]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[16]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[17]  Bernhard Plattner,et al.  Entropy based worm and anomaly detection in fast IP networks , 2005, 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (WETICE'05).

[18]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.