Always Good Turing: Asymptotically Optimal Probability Estimation

While deciphering the German Enigma code during World War II, I.J. Good and A.M. Turing considered the problem of estimating a probability distribution from a sample of data. They derived a surprising and unintuitive formula that has since been used in a variety of applications and studied by a number of researchers. Borrowing an information-theoretic and machine-learning framework, we define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily-long sequence by any distribution, and the corresponding probability assigned by the estimator. We show that some common estimators have infinite attenuation and that the attenuation of the Good-Turing estimator is low, yet larger than one. We then derive an estimator whose attenuation is one, namely, as the length of any sequence increases, the per-symbol probability assigned by the estimator is at least the highest possible. Interestingly, some of the proofs use celebrated results by Hardy and Ramanujan on the number of partitions of an integer. To better understand the behavior of the estimator, we study the probability it assigns to several simple sequences. We show that some sequences this probability agrees with our intuition, while for others it is rather unexpected.

[1]  Arthur Nádas Good, Jekinek, Mercer, and Robbins on Turing's Estimate of Probabilities , 1991 .

[2]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[3]  Arthur Nádas,et al.  On Turing's formula for word probabilities , 1985, IEEE Trans. Acoust. Speech Signal Process..

[4]  G. Hardy,et al.  Asymptotic Formulaæ in Combinatory Analysis , 1918 .

[5]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[6]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[7]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[8]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[9]  I. Good,et al.  Turing’s anticipation of empirical bayes in connection with the cryptanalysis of the naval enigma , 2000 .

[10]  Imre Csiszár,et al.  Redundancy rates for renewal and other processes , 1996, IEEE Trans. Inf. Theory.

[11]  G. Hardy,et al.  Asymptotic formulae in combinatory analysis , 1918 .

[12]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[13]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[14]  Alon Orlitsky,et al.  Performance of universal codes over infinite alphabets , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[15]  Ben J. M. Smeets,et al.  Multialphabet coding with separate alphabet description , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  K. Hensel Journal für die reine und angewandte Mathematik , 1892 .

[17]  Gábor Lugosi,et al.  Minimax regret under log loss for general classes of experts , 1999, COLT '99.

[18]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[19]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[20]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[21]  W. Hayman A Generalisation of Stirling's Formula. , 1956 .

[22]  T. Cover Universal Portfolios , 1996 .

[23]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[24]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[25]  Lee D. Davisson,et al.  Universal noiseless coding , 1973, IEEE Trans. Inf. Theory.

[26]  Alon Orlitsky,et al.  Universal compression of memoryless sources over unknown alphabets , 2004, IEEE Transactions on Information Theory.

[27]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[28]  A. Orlitsky,et al.  Universal compression of unknown alphabets , 2002, Proceedings IEEE International Symposium on Information Theory,.