Approximating and testing k-histogram distributions in sub-linear time

A discrete distribution <i>p</i>, over <i>[n]</i>, is a <i>k</i> histogram if its probability distribution function can be represented as a piece-wise constant function with <i>k</i> pieces. Such a function is represented by a list of <i>k</i> intervals and <i>k</i> corresponding values. We consider the following problem: given a collection of samples from a distribution <i>p</i>, find a <i>k</i>-histogram that (approximately) minimizes the l <sub>2</sub> distance to the distribution <i>p</i>. We give time and sample efficient algorithms for this problem. We further provide algorithms that distinguish distributions that have the property of being a <i>k</i>-histogram from distributions that are ε-far from any <i>k</i>-histogram in the l <sub>1</sub> distance and l <sub>2</sub> distance respectively.

[1]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[2]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[3]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[4]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[5]  Ronitt Rubinfeld,et al.  Robust Characterizations of Polynomials with Applications to Program Testing , 1996, SIAM J. Comput..

[6]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[8]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[9]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[10]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[11]  Dana Ron Property Testing: A Learning Theory Perspective , 2008, Found. Trends Mach. Learn..

[12]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[13]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[14]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[15]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[16]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms , 2011, SIAM J. Discret. Math..

[17]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[18]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[19]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[20]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[21]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[22]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.