The Optimal Approximation Factor in Density Estimation

Consider the following problem: given two arbitrary densities $q_1,q_2$ and a sample-access to an unknown target density $p$, find which of the $q_i$'s is closer to $p$ in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses $O(\epsilon^{-2})$ samples from $p$ and outputs~$q_i$ such that with high probability, $TV(q_i,p) \leq 3\cdot\mathsf{opt} + \epsilon$, where $\mathsf{opt}= \min\{TV(q_1,p),TV(q_2,p)\}$. Moreover, this result extends to any finite class of densities $\mathcal{Q}$: there exists an algorithm that outputs the best density in $\mathcal{Q}$ up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from $\mathcal{Q}$, and (ii) if one allows the algorithm to output arbitrary densities (e.g.\ a mixture of densities from $\mathcal{Q}$), then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of 2: an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.

[1]  K. Pearson Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material , 1895 .

[2]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[5]  L. Devroye,et al.  Nonparametric Density Estimation: The L 1 View. , 1985 .

[6]  Y. Yatracos Rates of Convergence of Minimum Distance Estimators and Kolmogorov's Entropy , 1985 .

[7]  L. Devroye,et al.  Nonparametric density estimation : the L[1] view , 1987 .

[8]  G. Lugosi,et al.  Consistency of Data-driven Histogram Methods for Density Estimation and Classification , 1996 .

[9]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[10]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[11]  G. Lugosi,et al.  Bin width selection in multivariate histograms by the combinatorial method , 2004 .

[12]  Daniel Stefankovic,et al.  Density Estimation in Linear Time , 2007, COLT.

[13]  Adam Tauman Kalai,et al.  Disentangling Gaussians , 2012, Commun. ACM.

[14]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[15]  Amit Daniely,et al.  Optimal learners for multiclass problems , 2014, COLT.

[16]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[17]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[18]  Noga Alon,et al.  Sign rank versus VC dimension , 2015, COLT.

[19]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[20]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[21]  Ilias Diakonikolas,et al.  Learning Structured Distributions , 2016, Handbook of Big Data.

[22]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[23]  Minimax Estimation of the $L_1$ Distance , 2017, IEEE Trans. Inf. Theory.

[24]  Jerry Li,et al.  Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms , 2018, COLT.

[25]  Shai Ben-David,et al.  Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes , 2018, NeurIPS.

[26]  Daniel M. Kane,et al.  List-decodable robust mean estimation and learning mixtures of spherical gaussians , 2017, STOC.

[27]  Pravesh Kothari,et al.  Robust moment estimation and improved clustering via sum of squares , 2018, STOC.

[28]  Shai Ben-David,et al.  Sample-Efficient Learning of Mixtures , 2017, AAAI.