Information content versus word length in random typing

Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al 2011 Proc. Nat. Acad. Sci. 108 3825). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi and co-workers. In itself, the linear relation does not entail the results of any optimization process.

[1]  Michael Mitzenmacher,et al.  Power laws for monkeys typing randomly: the case of unequal probabilities , 2004, IEEE Transactions on Information Theory.

[2]  John R. Buck,et al.  The use of Zipf's law in animal communication analysis , 2005, Animal Behaviour.

[3]  Marcelo A. Montemurro,et al.  Long-range fractal correlations in literary corpora , 2002, ArXiv.

[4]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[5]  Govindasamy Agoramoorthy,et al.  Efficiency of coding in macaque vocal communication , 2010, Biology Letters.

[6]  R. Ferrer i Cancho,et al.  Zipf's law from a communicative phase transition , 2005 .

[7]  T. Florian Jaeger,et al.  Redundancy and reduction: Speakers manage syntactic information density , 2010, Cognitive Psychology.

[8]  Ramon Ferrer-i-Cancho,et al.  Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution , 2010, PloS one.

[9]  Dmitrii Y. Manin,et al.  Zipf's Law and Avoidance of Excessive Synonymy , 2007, Cogn. Sci..

[10]  G. A. Miller,et al.  Finitary models of language users , 1963 .

[11]  Ramon Ferrer-i-Cancho,et al.  The frequency spectrum of finite samples from the intermittent silence process , 2009 .

[12]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[13]  Ramon Ferrer Cancho,et al.  The frequency spectrum of finite samples from the intermittent silence process , 2009 .

[14]  Steven T Piantadosi,et al.  Word lengths are optimized for efficient communication , 2011, Proceedings of the National Academy of Sciences.

[15]  Fermin Moscoso del Prado The universal "shape" of human languages: spectral analysis beyond speech , 2011 .

[16]  Daniel Polani,et al.  Phase transitions in least-effort communications , 2010 .

[17]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[18]  David Lusseau,et al.  Efficient coding in dolphin surface behavioral patterns , 2009 .