Entropy Rate Estimation for Markov Chains with Large State Space

Estimating the entropy based on data is one of the prototypical problems in distribution property testing and estimation. For estimating the Shannon entropy of a distribution on $S$ elements with independent samples, [Paninski2004] showed that the sample complexity is sublinear in $S$, and [Valiant--Valiant2011] showed that consistent estimation of Shannon entropy is possible if and only if the sample size $n$ far exceeds $\frac{S}{\log S}$. In this paper we consider the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. We show that: (1) As long as the Markov chain mixes not too slowly, i.e., the relaxation time is at most $O(\frac{S}{\ln^3 S})$, consistent estimation is achievable when $n \gg \frac{S^2}{\log S}$. (2) As long as the Markov chain has some slight dependency, i.e., the relaxation time is at least $1+\Omega(\frac{\ln^2 S}{\sqrt{S}})$, consistent estimation is impossible when $n \lesssim \frac{S^2}{\log S}$. Under both assumptions, the optimal estimation accuracy is shown to be $\Theta(\frac{S^2}{n \log S})$. In comparison, the empirical entropy rate requires at least $\Omega(S^2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.

[1]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[2]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[3]  Gabriela Ciuperca,et al.  On the estimation of the entropy rate of finite Markov chains , 2005 .

[4]  Tsachy Weissman,et al.  Minimax Redundancy for Markov Chains with Large State Space , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[5]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[6]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[7]  John C. Kieffer,et al.  Sample converses in source coding theory , 1991, IEEE Trans. Inf. Theory.

[8]  Albert-László Barabási,et al.  Limits of Predictability in Human Mobility , 2010, Science.

[9]  D. Paulin Concentration inequalities for Markov chains by Marton couplings and spectral methods , 2012, 1212.2015.

[10]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[11]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[12]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[13]  A. Antos,et al.  Convergence properties of functional estimates for discrete distributions , 2001 .

[14]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[15]  Mitsuhiro Nakamura,et al.  Predictability of conversation partners , 2011, ArXiv.

[16]  Bernardo A. Huberman,et al.  How Random are Online Social Interactions? , 2012, Scientific reports.

[17]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[18]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[19]  Alon Orlitsky,et al.  Learning Markov distributions: Does estimation trump compression? , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[20]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[21]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[22]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[23]  Yanjun Han,et al.  Optimal rates of entropy estimation over Lipschitz balls , 2017, The Annals of Statistics.

[24]  Constantinos Daskalakis,et al.  Testing Symmetric Markov Chains From a Single Trajectory , 2018, COLT.

[25]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[26]  Yanjun Han,et al.  Maximum Likelihood Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[27]  P. Billingsley,et al.  Statistical Methods in Markov Chains , 1961 .

[28]  Alex Pentland,et al.  The predictability of consumer visitation patterns , 2010, Scientific Reports.

[29]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Haim H. Permuter,et al.  Universal Estimation of Directed Information , 2010, IEEE Transactions on Information Theory.

[32]  Ravi Montenegro,et al.  Mathematical Aspects of Mixing Times in Markov Chains , 2006, Found. Trends Theor. Comput. Sci..

[33]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[34]  Csaba Szepesvári,et al.  Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path , 2015, NIPS.

[35]  Sanjeev R. Kulkarni,et al.  Universal entropy estimation via block sorting , 2004, IEEE Transactions on Information Theory.

[36]  Daniel Jurafsky,et al.  Data Noising as Smoothing in Neural Network Language Models , 2017, ICLR.

[37]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[38]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[39]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[40]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[41]  Qian Jiang,et al.  Construction of Transition Matrices of Reversible Markov Chains , 2009 .

[42]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[43]  P. Shields The Ergodic Theory of Discrete Sample Paths , 1996 .

[44]  Y. Peres,et al.  Estimating the Spectral Gap of a Reversible Markov Chain from a Short Trajectory , 2016, 1612.05330.

[45]  Charles Bordenave,et al.  Spectrum of large random reversible Markov chains: two examples , 2008, 0811.1097.

[46]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[47]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[48]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[49]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[50]  Sergio Verdú,et al.  Estimation of entropy rate and Rényi entropy rate for Markov chains , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).