Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications

An infinite urn scheme is defined by a probability mass function $(p_j)_{j\geq 1}$ over positive integers. A random allocation consists of a sample of $N$ independent drawings according to this probability distribution where $N$ may be deterministic or Poisson-distributed. This paper is concerned with occupancy counts, that is with the number of symbols with $r$ or at least $r$ occurrences in the sample, and with the missing mass that is the total probability of all symbols that do not occur in the sample. Without any further assumption on the sampling distribution, these random quantities are shown to satisfy Bernstein-type concentration inequalities. The variance factors in these concentration inequalities are shown to be tight if the sampling distribution satisfies a regular variation property. This regular variation property reads as follows. Let the number of symbols with probability larger than $x$ be $\vec{\nu}(x) = |\{ j \colon p_j \geq x\}|$. In a regularly varying urn scheme, $\vec{\nu}$ satisfies $\lim_{\tau\rightarrow 0} \vec{\nu}(\tau x)/\vec\nu(\tau) = x^{-\alpha}$ for $\alpha \in [0,1]$ and the variance of the number of distinct symbols in a sample tends to infinity as the sample size tends to infinity. Among other applications, these concentration inequalities allow us to derive tight confidence intervals for the Good-Turing estimator of the missing mass.

[1]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[4]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5]  S. Karlin Central Limit Theorems for Certain Infinite Urn Schemes , 1967 .

[6]  C. Anderson Extreme value theory for a class of discrete distributions with applications to some stochastic processes , 1970 .

[7]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[8]  J. Geluk Π-regular variation , 1981 .

[9]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[10]  W. Esty Confidence Intervals for the Coverage of Low Coverage Samples , 1982 .

[11]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[12]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[13]  Desh Ranjan,et al.  Balls and Bins: A Study in Negative Dependence , 1996 .

[14]  Lawrence K. Saul,et al.  Large Deviation Methods for Approximate Probabilistic Inference , 1998, UAI.

[15]  Qi-Man Shao,et al.  A Comparison Theorem on Moment Inequalities Between Negatively Associated and Independent Random Variables , 2000 .

[16]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[17]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[18]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[19]  Luis E. Ortiz,et al.  Concentration Inequalities for the Missing Mass and for Histogram Rule Error , 2003, J. Mach. Learn. Res..

[20]  Martin Raič,et al.  Normal Approximation by Stein ’ s Method , 2003 .

[21]  Alon Orlitsky,et al.  Universal compression of memoryless sources over unknown alphabets , 2004, IEEE Transactions on Information Theory.

[22]  Nicole A. Lazar,et al.  Statistics of Extremes: Theory and Applications , 2005, Technometrics.

[23]  Jean Bertoin,et al.  Random fragmentation and coagulation processes , 2006 .

[24]  L. Haan,et al.  Extreme value theory : an introduction , 2006 .

[25]  L. Haan,et al.  Extreme value theory , 2006 .

[26]  J. Pitman,et al.  Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws ∗ , 2007, math/0701718.

[27]  Svante Janson,et al.  Local limit theorems for nite and innite urn models , 2007 .

[28]  A. D. Barbour,et al.  Small counts in the infinite occupancy scheme , 2008, 0809.4387.

[29]  L. V. Bogachev,et al.  On the variance of the number of occupied boxes , 2008, Adv. Appl. Math..

[30]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[31]  Svante Janson,et al.  LOCAL LIMIT THEOREMS FOR FINITE AND INFINITE URN MODELS , 2006, math/0604397.

[32]  A. Gnedin Regeneration in random combinatorial structures , 2009, 0901.4444.

[33]  Gaps in Discrete Random Samples , 2009, Journal of Applied Probability.

[34]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[35]  D. Berend,et al.  On the concentration of the missing mass , 2012, 1210.3248.

[36]  Munther A. Dahleh,et al.  Rare Probability Estimation under Regularly Varying Heavy Tails , 2012, COLT.

[37]  Alon Orlitsky,et al.  Optimal Probability Estimation with Applications to Prediction and Classification , 2013, COLT.

[38]  Igal Sason,et al.  Concentration of Measure Inequalities in Information Theory, Communications, and Coding , 2012, Found. Trends Commun. Inf. Theory.

[39]  M. Raginsky Concentration of Measure Inequalities in Information Theory, Communications, and Coding: Second Edition , 2014 .

[40]  Daniel Berend,et al.  A finite sample analysis of the Naive Bayes classifier , 2015, J. Mach. Learn. Res..

[41]  Jay Bartroff,et al.  Bounded size biased couplings, log concave distributions and concentration of measure for occupancy models , 2014, Bernoulli.

[42]  Elchanan Mossel,et al.  On the Impossibility of Learning the Missing Mass , 2015, Entropy.