Word lengths are optimized for efficient communication

We demonstrate a substantial improvement on one of the most celebrated empirical laws in the study of language, Zipf's 75-y-old theory that word length is primarily determined by frequency of use. In accord with rational theories of communication, we show across 10 languages that average information content is a much better predictor of word length than frequency. This indicates that human lexicons are efficiently structured for communication by taking into account interword statistical dependencies. Lexical systems result from an optimization of communicative pressures, coding meanings efficiently given the complex statistics of natural language use.

[1]  T. Florian Jaeger,et al.  Redundancy and reduction: Speakers manage syntactic information density , 2010, Cognitive Psychology.

[2]  R. Shillcock,et al.  Eye Movements Reveal the On-Line Computation of Lexical Probabilities During Reading , 2003, Psychological science.

[3]  Uriel Cohen Priva Using Information Content to PredictPhone Deletion , 2008 .

[4]  C. Fowler,et al.  Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[5]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.

[6]  William D. Raymond,et al.  The effects of collocational strength and contextual predictability in lexical production 1 , 1999 .

[7]  Susan M. Garnsey,et al.  Semantic Influences On Parsing: Use of Thematic Role Information in Syntactic Ambiguity Resolution , 1994 .

[8]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[9]  Frank Keller,et al.  Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , 2008, Cognition.

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  Duane G. Watson,et al.  The influence of contextual contrast on syntactic processing: evidence for strong-interaction in sentence comprehension , 2005, Cognition.

[12]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13]  Peter Grzybek,et al.  History and Methodology of Word Length Studies , 2007 .

[14]  Austin F. Frank,et al.  Speaking Rationally: Uniform Information Density as an Optimal Strategy for Language Production , 2008 .

[15]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[16]  A. Garnham,et al.  Avoiding the garden path: Eye movements in context , 1992 .

[17]  Gabriel Altmann,et al.  Word Length and Word Frequency , 2007 .

[18]  K. Rayner,et al.  Effects of contextual predictability and transitional probability on eye movements during reading. , 2005, Journal of experimental psychology. Learning, memory, and cognition.

[19]  Louis C. W. Pols,et al.  How efficient is speech , 2003 .

[20]  Michael J. Spivey,et al.  Syntactic ambiguity resolution in discourse: modeling the effects of referential context and lexical frequency. , 1998, Journal of experimental psychology. Learning, memory, and cognition.

[21]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[22]  G. Altmann,et al.  The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye-movements , 2003 .

[23]  R. G. Kent,et al.  Language: Its Nature, Development, and Origin , 1923 .

[24]  Daniel Jurafsky,et al.  A Probabilistic Model of Lexical and Syntactic Access and Disambiguation , 1996, Cogn. Sci..

[25]  J. Bresnan,et al.  Syntactic probabilities affect pronunciation variation in spontaneous speech , 2009, Language and Cognition.

[26]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[27]  Alice Turk,et al.  The Smooth Signal Redundancy Hypothesis: A Functional Explanation for Relationships between Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech , 2004, Language and speech.

[28]  Lyn Frazier,et al.  ON COMPREHENDING SENTENCES: SYNTACTIC PARSING STRATEGIES. , 1979 .

[29]  Roger Levy,et al.  Speakers optimize information density through syntactic reduction , 2006, NIPS.

[30]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[31]  R. Schiffer Psychobiology of Language , 1986 .

[32]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[33]  J. Trueswell,et al.  The role of discourse context in the processing of a flexible word-order language , 2004, Cognition.

[34]  Peter Grzybek,et al.  Contributions to the science of text and language : word length studies and related issues , 2006 .

[35]  Edward Gibson,et al.  The Communicative Lexicon Hypothesis , 2009 .

[36]  Susan M. Garnsey,et al.  Knowledge of Grammar, Knowledge of Usage: Syntactic Probabilities Affect Pronunciation Variation , 2004 .

[37]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[38]  G. Altmann,et al.  Incremental interpretation at verbs: restricting the domain of subsequent reference , 1999, Cognition.

[39]  P. Lieberman Some Effects of Semantic and Grammatical Context on the Production and Perception of Speech , 1963 .

[40]  Dmitrii Manin,et al.  Experiments on predictability of word in context and information rate in natural language , 2006, ArXiv.

[41]  J. Weijer,et al.  Word length, sentence length and frequency: Zipf revisited , 2004 .

[42]  Mark Steedman,et al.  Interaction with context during human sentence processing , 1988, Cognition.