Near-Optimal Closeness Testing of Discrete Histogram Distributions

We investigate the problem of testing the equivalence between two discrete histograms. A k-histogram over [n] is a probability distribution that is piecewise constant over some set of k intervals over [n]. Histograms have been extensively studied in computer science and statistics. Given a set of samples from two k-histogram distributions p, q over [n], we want to distinguish (with high probability) between the cases that p = q and ||p ? q||_1 >= epsilon. The main contribution of this paper is a new algorithm for this testing problem and a nearly matching information-theoretic lower bound. Specifically, the sample complexity of our algorithm matches our lower bound up to a logarithmic factor, improving on previous work by polynomial factors in the relevant parameters. Our algorithmic approach applies in a more general setting and yields improved sample upper bounds for testing closeness of other structured distributions as well.

[1]  J. Kalbfleisch Statistical Inference Under Order Restrictions , 1975 .

[2]  Ronitt Rubinfeld,et al.  Testing Shape Restrictions of Discrete Distributions , 2015, Theory of Computing Systems.

[3]  Chinmay Hegde,et al.  Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms , 2015, PODS.

[4]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[5]  Ronitt Rubinfeld Taming big probability distributions , 2012, XRDS.

[6]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[7]  Daniel M. Kane,et al.  Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[8]  Carl M. O’Brien,et al.  Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics , 2016 .

[9]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[10]  Daniel M. Kane,et al.  Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables , 2015, COLT.

[11]  Ilias Diakonikolas,et al.  Sample-Optimal Density Estimation in Nearly-Linear Time , 2015, SODA.

[12]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[13]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[14]  Jerry Li,et al.  Fast Algorithms for Segmented Regression , 2016, ICML.

[15]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[16]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[17]  Anindya De,et al.  A size-free CLT for poisson multinomials and its applications , 2015, STOC.

[18]  G. Lugosi,et al.  Bin width selection in multivariate histograms by the combinatorial method , 2004 .

[19]  Ilias Diakonikolas,et al.  Collision-based Testers are Optimal for Uniformity and Closeness , 2016, Electron. Colloquium Comput. Complex..

[20]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[21]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[22]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[24]  Daniel M. Kane,et al.  Learning Multivariate Log-concave Distributions , 2016, COLT.

[25]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[26]  Robert D. Nowak,et al.  Multiscale Poisson Intensity and Density Estimation , 2007, IEEE Transactions on Information Theory.

[27]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[28]  Jussi Klemela MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS , 2009 .

[29]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[30]  Clément L. Canonne Are Few Bins Enough: Testing Histogram Distributions , 2016, PODS.

[31]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[32]  Ilias Diakonikolas,et al.  Differentially Private Learning of Structured Discrete Distributions , 2015, NIPS.

[33]  Daniel M. Kane,et al.  Efficient Robust Proper Learning of Log-concave Distributions , 2016, ArXiv.

[34]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[35]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[36]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[37]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[38]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[39]  G. Lugosi,et al.  Consistency of Data-driven Histogram Methods for Density Estimation and Classification , 1996 .

[40]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[41]  D. W. Scott On optimal and data based histograms , 1979 .

[42]  Daniel M. Kane,et al.  Properly Learning Poisson Binomial Distributions in Almost Polynomial Time , 2015, COLT.

[43]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.

[44]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[45]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[46]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[47]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[48]  Daniel M. Kane,et al.  The fourier transform of poisson multinomial distributions and its algorithmic applications , 2015, STOC.