Upper Bound of Entropy Rate Revisited --A New Extrapolation of Compressed Large-Scale Corpora--

The article presents results of entropy rate estimation for human languages across six languages by using large, state-of-the-art corpora of up to 7.8 gigabytes. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes of this kind were proposed in previous research papers, here we introduce a stretched exponential extrapolation function that has a smaller error of fit. In this way, we uncover a possibility that the entropy rates of human languages are positive but 20% smaller than previously reported.

[1]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[2]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[3]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[4]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[5]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[6]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[7]  Eugene Charniak,et al.  Entropy Rate Constancy in Text , 2002, ACL.

[8]  Guy Louchard,et al.  Average redundancy rate of the Lempel-Ziv code , 1996, Proceedings of Data Compression Conference - DCC '96.

[9]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[10]  Kumiko Tanaka-Ishii,et al.  Entropy Rate Estimates for Natural Language - A New Extrapolation of Compressed Large-Scale Corpora , 2016, Entropy.

[11]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[14]  Kevin Atteson,et al.  The asymptotic redundancy of Bayes rules for Markov chains , 1999, IEEE Trans. Inf. Theory.

[15]  Peter Grassberger Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution , 2002 .

[16]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[17]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[18]  Boris Ryabko,et al.  Applications of Universal Source Coding to Statistical Analysis of Time Series , 2008, ArXiv.

[19]  Roger Levy,et al.  Speakers optimize information density through syntactic reduction , 2006, NIPS.

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  Peter Grassberger,et al.  Entropy estimation of symbol sequences. , 1996, Chaos.