The statistical trade-off between word order and word structure – Large-scale evidence for the principle of least effort

Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that–despite differences in the way information is expressed–there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.

[1]  Craig A. Evans,et al.  The Routledge encyclopedia of the historical Jesus , 2014 .

[2]  Fermín Moscoso del Prado Martín,et al.  The mirage of morphological complexity , 2011, CogSci.

[3]  Christian Bentz,et al.  A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora , 2016, CL4LC@COLING 2016.

[4]  Elissa L. Newport,et al.  Balancing Effort and Information Transmission During Language Acquisition: Evidence From Word Order and Case Marking , 2017, Cogn. Sci..

[5]  Pieter A. M. Seuren Simple and transparent [Commentary on The worlds simplest grammars are creole grammars by John H. McWhorter] , 2001 .

[6]  Steven T Piantadosi,et al.  Word lengths are optimized for efficient communication , 2011, Proceedings of the National Academy of Sciences.

[7]  Thomas P. Hettmansperger,et al.  Department of Statistics , 2003 .

[8]  Stefan Engelberg,et al.  Grammatik ohne Wörter , 2011 .

[9]  REINHARD KÖHLER,et al.  SYSTEM THEORETICAL LINGUISTICS , 1987 .

[10]  Fermin Moscoso del Prado The mirage of morphological complexity , 2011 .

[11]  Ioannis Kontoyiannis The complexity and entropy of literary styles , 1997 .

[12]  J. McWhorter,et al.  The worlds simplest grammars are creole grammars , 2001 .

[13]  Benedikt Szmrecsanyi,et al.  An information-theoretic approach to assess linguistic complexity , 2016 .

[14]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[15]  Patrick Juola Assessing linguistic complexity , 2008 .

[16]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[17]  David Crystal,et al.  The Cambridge Encyclopedia of Language , 2012, Modern Language Review.

[18]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[19]  Suzanne Romaine,et al.  The Cambridge history of the English language , 1992 .

[20]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[21]  Patrick Juola Measuring Linguistic Complexity: The Morphological Tier , 1998, J. Quant. Linguistics.

[22]  Kumiko Tanaka-Ishii,et al.  Entropy Rate Estimates for Natural Language - A New Extrapolation of Compressed Large-Scale Corpora , 2016, Entropy.

[23]  Stuart James,et al.  The Cambridge Encyclopedia of Language (3rd ed.) , 2011 .

[24]  M. Montemurro,et al.  Universal Entropy of Word Ordering Across Linguistic Families , 2011, PloS one.

[25]  Marcelo A. Montemurro,et al.  Long-range fractal correlations in literary corpora , 2002, ArXiv.

[26]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[27]  Peter Grassberger,et al.  Entropy estimation of symbol sequences. , 1996, Chaos.

[28]  D. Bickerton Language and Human Behavior , 1996 .

[29]  Fermín Moscoso del Prado Martín,et al.  Grammatical Change Begins within the Word: Causal Modeling of the Co-evolution of Icelandic Morphology and Syntax , 2014, CogSci.

[30]  Sir W. M. Ramsay,et al.  Book Review: The Bearing of Recent Discovery on the Trustworthiness of the New Testament , 1953 .

[31]  Ramon Ferrer-i-Cancho,et al.  Zipf's law of abbreviation as a language universal , 2016 .

[32]  Natsuko Tsujimura Phonology and morphology , 2005 .

[33]  Delbert Burkett,et al.  An introduction to the New Testament and the origins of Christianity , 2002 .

[34]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[35]  Nick C. Ellis,et al.  Implicit AND explicit language learning: Their dynamic interface and complexity , 2015 .

[36]  Richard M. Hogg,et al.  Phonology and Morphology , 1992 .

[37]  D. Adger,et al.  Syntax , 2014, Wiley interdisciplinary reviews. Cognitive science.

[38]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[39]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[40]  Geza Vermes The Routledge Encyclopedia of the Historical Jesus , 2011 .

[41]  Bob Duckett,et al.  The Cambridge History of the English Language , 1999 .

[42]  Stefan Engelberg,et al.  Sprachliches Wissen zwischen Lexikon und Grammatik , 2011 .

[43]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[44]  Werner Ebeling,et al.  Long-range correlations between letters and sentences in texts , 1995 .

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  Mark Aronoff,et al.  Contemporary linguistics: An introduction , 1989 .

[47]  Martin Haspelmath,et al.  The indeterminacy of word segmentation and the nature of morphology and syntax , 2011 .