Estimation of Entropy from Subword Complexity

Subword complexity is a function that describes how many different substrings of a given length are contained in a given string. In this paper, two estimators of block entropy are proposed, based on the profile of subword complexity. The first estimator works well only for IID processes with uniform probabilities. The second estimator provides a lower bound of block entropy for any strictly stationary process with the distributions of blocks skewed towards less probable values. Using this estimator, some estimates of block entropy for natural language are obtained, confirming earlier hypotheses.

[1]  Svante Janson,et al.  On the average sequence complexity , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[2]  David Koslicki,et al.  Topological entropy of DNA sequences , 2011, Bioinform..

[3]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[4]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[5]  T. Cover,et al.  A sandwich proof of the Shannon-McMillan-Breiman theorem , 1988 .

[6]  H. Joe Estimation of entropy and other functionals of a multivariate density , 1989 .

[7]  R. Cranley,et al.  Multivariate Analysis—Methods and Applications , 1985 .

[8]  Mark Daniel Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[9]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[10]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[11]  Werner Ebeling,et al.  Entropy of symbolic sequences: the role of correlations , 1991 .

[12]  Hannah Vogel,et al.  On the shape of subword complexity sequences of finite words , 2013, ArXiv.

[13]  W. Ebeling,et al.  A New Method to Calculate Higher-Order Entropies from Finite Samples , 1993 .

[14]  Lukasz Debowski,et al.  On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts , 2008, IEEE Transactions on Information Theory.

[15]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[17]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[18]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[19]  Lukasz Debowski,et al.  A New Universal Code Helps to Distinguish Natural Language from Random Texts , 2015, Recent Contributions to Quantitative Linguistics.

[20]  Sébastien Ferenczi,et al.  Complexity of sequences and dynamical systems , 1999, Discret. Math..

[21]  E. Khmaladze The statistical analysis of a large number of rare events , 1988 .

[22]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[23]  P. Hall,et al.  On the estimation of entropy , 1993 .

[24]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[25]  Aldo de Luca,et al.  On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  Lukasz Debowski,et al.  Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture , 2015, Entropy.

[28]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[29]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[30]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .