Entropy of natural languages: Theory and experiment

Abstract The concept of the entropy of natural languages, first introduced by Shannon [A mathematical theory of communications, Bell Syst. Tech. J.27, 379–423 (1948)] and its significance is discussed. A review of various known approaches to and results of previous studies of language entropy is presented. A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed. This method is a refinement of Shannon's prediction (guessing) method [Shannon, Prediction and entropy of printed English, Bell Syst. Tech. J.30, 54–64 (1951)]. The evaluation of the lower bound is shown to be a classical linear programming problem. Statistical analysis of the estimation of the bounds is given and procedures for the statistical treatment of the experimental data (including verification of statistical validity and sigficance) are elaborated. The method has been applied to printed Hebrew texts in a large experiment (1000 independent samples) in order to evaluate entropy and other information-theoretical characteristics of the Hebrew language. The results have demonstrated the efficiency of the new method: the gap between the upper and lower bounds of entropy has been reduced by a factor of 2.25 compared to the original Shannon approach. Comparison with other languages is given. Possible applications of the method are briefly discussed.

[1]  Gustav Herdan,et al.  The advanced theory of language as choice and chance , 1968 .

[2]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[3]  Nesa L'abbe Wu,et al.  Linear programming and extensions , 1981 .

[4]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[5]  J. Licklider,et al.  Long-range constraints in the statistical structure of printed English. , 1955, The American journal of psychology.

[6]  Yehoshua Bar-Hillel,et al.  Language and Information , 1964 .

[7]  B. Partee,et al.  Mathematical Methods in Linguistics , 1990 .

[8]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[9]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[10]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[11]  G. A. Barnard,et al.  Statistical calculation of word entropies for four Western languages , 1955, IRE Trans. Inf. Theory.

[12]  Claude E. Shannon,et al.  A Mathematical Theory of Communications , 1948 .

[13]  Yehoshua Bar-Hillel,et al.  Language and information : selected essays on their theory and application , 1965 .

[14]  M. Wanas,et al.  First second- and third-order entropies of Arabic text (Corresp.) , 1976, IEEE Trans. Inf. Theory.

[15]  Hermann Witting,et al.  Mathematische Statistik II , 1985 .

[16]  Werner Ebeling,et al.  Word frequency and entropy of symbolic sequences: a dynamical perspective , 1992 .

[17]  Thomas M. Cover,et al.  Probability and Information. , 1986 .

[18]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[19]  G. BOGUSLAVSKAJA,et al.  Informational Estimates of Text , 1971 .

[20]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[21]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[22]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .