Learning discrete distributions with infinite support

We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. In a departure from the established paradigm, we make no structural assumptions whatsoever on the sampling distribution. In such a setting, distribution-free risk bounds are impossible, and the best one could hope for is a fully empirical data-dependent bound. We derive precisely such bounds, and demonstrate that these are, in a well-defined sense, the best possible. Our main discovery is that the half-norm of the empirical distribution provides tight upper and lower estimates on the empirical risk. Furthermore, this quantity decays at a nearly optimal rate as a function of the true distribution. The optimality follows from a minimax result, of possible independent interest. Additional structural results are provided, including an exact Rademacher complexity calculation and apparently a first connection between the total variation risk and the missing mass.

[1]  Bo Waggoner,et al.  Lp Testing and Learning of Discrete Distributions , 2014, ITCS.

[2]  D. Berend,et al.  A sharp estimate of the binomial mean absolute deviation with applications , 2013 .

[3]  Julian Zimmert,et al.  Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits , 2018, J. Mach. Learn. Res..

[4]  L. Devroye,et al.  Nonparametric density estimation : the L[1] view , 1987 .

[5]  Frank Deutsch,et al.  Slow convergence of sequences of linear operators II: Arbitrarily slow convergence , 2010, J. Approx. Theory.

[6]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[7]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions Under $\ell _{1}$ Loss , 2014, IEEE Transactions on Information Theory.

[8]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[9]  Steven R. Dunbar,et al.  Topics in Probability Theory and Stochastic Processes , 2010 .

[10]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[11]  Shay Moran,et al.  The Optimal Approximation Factor in Density Estimation , 2019, COLT.

[12]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.

[13]  D. Berend,et al.  The Missing Mass Problem , 2011, 1111.2328.

[14]  Robert D. Nowak,et al.  Learning Minimum Volume Sets , 2005, J. Mach. Learn. Res..

[15]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[16]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[17]  Ilias Diakonikolas,et al.  Learning Structured Distributions , 2016, Handbook of Big Data.

[18]  Frank Deutsch,et al.  Slow convergence of sequences of linear operators I: Almost arbitrarily slow convergence , 2010, J. Approx. Theory.

[19]  Aryeh Kontorovich,et al.  Exact Lower Bounds for the Agnostic Probably-Approximately-Correct (PAC) Machine Learning Model , 2016, The Annals of Statistics.

[20]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions under ℓ1 Loss , 2014, ArXiv.

[21]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[22]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[23]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[24]  J. Wissel,et al.  On the Best Constants in the Khintchine Inequality , 2007 .