QUADRATIC TSALLIS ENTROPY BIAS AND GENERALIZED MAXIMUM ENTROPY MODELS

In density estimation task, Maximum Entropy (Maxent) model can effectively use reliable prior information via nonparametric constraints, that is, linear constraints without empirical parameters. However, reliable prior information is often insufficient, and parametric constraints becomes necessary but poses considerable implementation complexity. Improper setting of parametric constraints can result in overfitting or underfitting. To alleviate this problem, a generalization of Maxent, under Tsallis entropy framework, is proposed. The proposed method introduces a convex quadratic constraint for the correction of (expected) quadratic Tsallis Entropy Bias (TEB). Specifically, we demonstrate that the expected quadratic Tsallis entropy of sampling distributions is smaller than that of the underlying real distribution with regard to frequentist, Bayesian prior, and Bayesian posterior framework, respectively. This expected entropy reduction is exactly the (expected) TEB, which can be expressed by the closed‐form formula and acts as a consistent and unbiased correction with an appropriate convergence rate. TEB indicates that the entropy of a specific sampling distribution should be increased accordingly. This entails a quantitative reinterpretation of the Maxent principle. By compensating TEB and meanwhile forcing the resulting distribution to be close to the sampling distribution, our generalized quadratic Tsallis Entropy Bias Compensation (TEBC) Maxent can be expected to alleviate the overfitting and underfitting. We also present a connection between TEB and Lidstone estimator. As a result, TEB–Lidstone estimator is developed by analytically identifying the rate of probability correction in Lidstone. Extensive empirical evaluation shows promising performance of both TEBC Maxent and TEB‐Lidstone in comparison with various state‐of‐the‐art density estimation methods.

[1]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[2]  Tong Zhang,et al.  Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification , 2004, NIPS.

[3]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[4]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[5]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[7]  Yoram Singer,et al.  Smooth epsiloon-Insensitive Regression by Loss Symmetrization , 2005, Journal of machine learning research.

[8]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[9]  A. Carlton On the bias of information estimates. , 1969 .

[10]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[11]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[12]  Sanjeev Khudanpur,et al.  Maximum Likelihood Set for Estimating a Probability Mass Function , 2005, Neural Computation.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[15]  Michael J. Berry,et al.  Weak pairwise correlations imply strongly correlated network states in a neural population , 2005, Nature.

[16]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[17]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[18]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[19]  S. R. S. Varadhan,et al.  Special invited paper. Large deviations , 2008, 0804.2330.

[20]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[21]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[22]  Jonathan D. Victor,et al.  Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials , 2000, Neural Computation.

[23]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[24]  C. Tsallis Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World , 2009 .

[25]  William I. Newman,et al.  Extension to the maximum entropy method , 1977, IEEE Trans. Inf. Theory.

[26]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[27]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[28]  Sergey V. Buldyrev,et al.  Critical effect of dependency groups on the function of networks , 2010, Proceedings of the National Academy of Sciences.

[29]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[30]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[31]  Raymond Lau,et al.  Adaptive statistical language modeling , 1994 .

[32]  Zhang Le,et al.  Maximum Entropy Modeling Toolkit for Python and C , 2004 .

[33]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Yoram Singer,et al.  Smooth ε-insensitive regression by loss symmetrization , 2003, COLT 2003.

[35]  Stefano Panzeri,et al.  Analytical estimates of limited sampling biases in different information measures. , 1996, Network.

[36]  G. Crooks On Measures of Entropy and Information , 2015 .

[37]  Carl N. Morris,et al.  CENTRAL LIMIT THEOREMS FOR MULTINOMIAL SUMS , 1975 .

[38]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[39]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[41]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[42]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[43]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[44]  S. Varadhan,et al.  Large deviations , 2019, Graduate Studies in Mathematics.

[45]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .