Sketching and Streaming Entropy via Approximation Theory

We give near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-optimal bounds in the insertion-only model without sketching. Our high-level approach is simple: we give algorithms to estimate Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials. Our work also yields the best-known and near-optimal additive approximations for entropy, and hence also for conditional entropy and mutual information.

[1]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[2]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[3]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[4]  P. Hayden,et al.  Renyi-entropic bounds on quantum communication , 2002 .

[5]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[6]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[7]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[8]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[9]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[10]  N. Linial,et al.  Expander Graphs and their Applications , 2006 .

[11]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[12]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[13]  Ping Li,et al.  Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections , 2008, SODA '08.

[14]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[15]  W. Rudin Principles of mathematical analysis , 1964 .

[16]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[17]  Karol Zyczkowski,et al.  Rényi Extrapolation of Shannon Entropy , 2003, Open Syst. Inf. Dyn..

[18]  T. J. Rivlin An Introduction to the Approximation of Functions , 2003 .

[19]  Donald F. Towsley,et al.  Detecting anomalies in network traffic using maximum entropy estimation , 2005, IMC '05.

[20]  Ping Li,et al.  Compressed counting , 2008, SODA.

[21]  Ashwin Lall,et al.  A data streaming algorithm for estimating entropies of od flows , 2007, IMC '07.

[22]  W. W. Rogosinski,et al.  Some Elementary Inequalities for Polynomials , 1955 .

[23]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[24]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[25]  Keping Ma,et al.  Unifying and distinguishing diversity ordering methods for comparing communities , 2007, Population Ecology.

[26]  Krzysztof Onak,et al.  Streaming algorithms for estimating entropy , 2008, 2008 IEEE Information Theory Workshop.

[27]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[28]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[29]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[30]  T. J. Rivlin Chebyshev polynomials : from approximation theory to algebra and number theory , 1990 .

[31]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[32]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[33]  Sumit Ganguly,et al.  Practical Algorithms for Tracking Database Join Sizes , 2005, FSTTCS.

[34]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[35]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[36]  Noam Nisan,et al.  On the degree of boolean functions as real polynomials , 1992, STOC '92.

[37]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[38]  Polska Akademia Nauk,et al.  R enyi Extrapolation of Shannon Entropy , 2003 .

[39]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[40]  Sumit Ganguly,et al.  Hierarchical Sampling from Sketches: Estimating Functions over Data Streams , 2009, Algorithmica.

[41]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[42]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[43]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[44]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[45]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..

[46]  David P. Woodruff Efficient and private distance approximation in the communication and streaming models , 2007 .

[47]  Carlo Ricotta,et al.  Parametric scaling from species to growth-form diversity: an interesting analogy with multifractal functions. , 2002, Bio Systems.

[48]  George M. Phillips,et al.  Theory and applications of numerical analysis , 1976, The Mathematical Gazette.

[49]  G. Phillips Interpolation and Approximation by Polynomials , 2003 .