Universals of word order reflect optimization of grammars for efficient communication

Significance Human languages share many grammatical properties. We show that some of these properties can be explained by the need for languages to offer efficient communication between humans given our cognitive constraints. Grammars of languages seem to find a balance between two communicative pressures: to be simple enough to allow the speaker to easily produce sentences, but complex enough to be unambiguous to the hearer, and this balance explains well-known word-order generalizations across our sample of 51 varied languages. Our results offer quantitative and computational evidence that language structure is dynamically shaped by communicative and cognitive pressures. The universal properties of human languages have been the subject of intense study across the language sciences. We report computational and corpus evidence for the hypothesis that a prominent subset of these universal properties—those related to word order—result from a process of optimization for efficient communication among humans, trading off the need to reduce complexity with the need to reduce ambiguity. We formalize these two pressures with information-theoretic and neural-network models of complexity and ambiguity and simulate grammars with optimized word-order parameters on large-scale data from 51 languages. Evolution of grammars toward efficiency results in word-order patterns that predict a large subset of the major word-order correlations across languages.

[1]  Haitao Liu,et al.  DDM at Work: Reply to comments on "Dependency distance: A new perspective on syntactic patterns in natural languages". , 2017, Physics of life reviews.

[2]  Fred Karlsson Finnish: An Essential Grammar , 1999 .

[3]  Richard Futrell,et al.  Quantifying Word Order Freedom in Dependency Corpora , 2015, DepLing.

[4]  Simon Kirby,et al.  Simplicity and Specificity in Language: Domain-General Biases Have Domain-Specific Effects , 2016, Front. Psychol..

[5]  R. F. Cancho,et al.  The global minima of the communicative energy of natural communication systems , 2007 .

[6]  C. F. Hockett The origin of speech. , 1960, Scientific American.

[7]  John A. Hawkins,et al.  A Performance Theory of Order and Constituency , 1995 .

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  D. G. Hays Dependency Theory: A Formalism and Some Observations , 1964 .

[10]  D. Schiffrin Meaning, form, and use in context : linguistic applications , 1984 .

[11]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[12]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[13]  Michael Meeuwis,et al.  Order of adposition and noun phrase , 2013 .

[14]  I. H. Fichte,et al.  Zeitschrift für Philosophie und philosophische Kritik , 2022 .

[15]  J. Rijkhoff,et al.  Explaining word order in the noun phrase , 1990 .

[16]  W. Streitberg,et al.  von der Gabelentz G. Die Sprachwissenschaft, ihre Aufgaben, Methoden und bisherigen Ergebnisse , 1893 .

[17]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[18]  Mirella Lapata,et al.  Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , 2017, EACL.

[19]  Robert Forkel,et al.  The World Atlas of Language Structures Online , 2009 .

[20]  William Croft,et al.  Functional Approaches to Grammar , 2015 .

[21]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[22]  Noam Chomsky Three Factors in Language Design , 2005, Linguistic Inquiry.

[23]  Yang Xu,et al.  Numeral Systems Across Languages Support Efficient Communication: From Approximate Numerosity to Recursion , 2020, Open Mind.

[24]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[25]  P. Smolensky,et al.  Learning biases predict a word order universal , 2012, Cognition.

[26]  J. Culbertson,et al.  Language learners privilege structured meaning over surface frequency , 2014, Proceedings of the National Academy of Sciences.

[27]  Marcello Barbieri,et al.  On the Origin of Language , 2010, Biosemiotics.

[28]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[29]  C H ANDREWES The viruses of the common cold. , 1960, Scientific American.

[30]  G. Carruthers,et al.  Educating professional musicians: lessons learned from school music , 2008 .

[31]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[32]  Maryia Fedzechkina,et al.  Language learners restructure their input to facilitate efficient communication , 2012, Proceedings of the National Academy of Sciences.

[33]  Paul-Christian Bürkner,et al.  Advanced Bayesian Multilevel Modeling with the R Package brms , 2017, R J..

[34]  Karl J. Friston The free-energy principle: a unified brain theory? , 2010, Nature Reviews Neuroscience.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Ramon Ferrer i Cancho,et al.  Euclidean distance between syntactically linked words. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Mark C. Baker,et al.  On Directionality and the Structure of the Verb Phrase: Evidence from Nupe , 2003 .

[38]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[39]  Daniel Gildea,et al.  Minimizing Syntactic Dependency Lengths: Typological/Cognitive Universal? , 2018 .

[40]  Barry K. Rosen,et al.  Syntactic Complexity , 1974, Inf. Control..

[41]  Marco Kuhlmann,et al.  Dependency Structures Derived from Minimalist Grammars , 2007, MOL.

[42]  Naftali Tishby,et al.  Semantic categories of artifacts and animals reflect efficient coding , 2019, SCIL.

[43]  Roger Levy,et al.  Generalizing dependency distance: Comment on "Dependency distance: A new perspective on syntactic patterns in natural languages" by Haitao Liu et al. , 2017, Physics of life reviews.

[44]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[45]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[46]  Michael Meeuwis,et al.  Order of subject, object, and verb , 2013 .

[47]  Timothy Osborne,et al.  The status of function words in dependency grammar: A critique of Universal Dependencies (UD) , 2019, Glossa: a journal of general linguistics.

[48]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[49]  Frans Plank,et al.  The Universals Archive: Α Brief Introduction for Prospective Users , 2000 .

[50]  S. Pinker,et al.  Natural language and natural selection , 1990, Behavioral and Brain Sciences.

[51]  Richard Futrell,et al.  Experiments with Generative Models for Dependency Tree Linearization , 2015, EMNLP.

[52]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[53]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[54]  T Florian Jaeger,et al.  On language 'utility': processing complexity and communicative efficiency. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[55]  Richard S. Kayne The Antisymmetry of Syntax , 1994 .

[56]  Adam Goodkind,et al.  Predictive power of word surprisal for reading times is a linear function of language model quality , 2018, CMCL.

[57]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  P. Kay,et al.  Color naming reflects optimal partitions of color space , 2007, Proceedings of the National Academy of Sciences.

[60]  E. Gibson The dependency locality theory: A distance-based theory of linguistic complexity. , 2000 .

[61]  Paul-Christian Bürkner,et al.  brms: An R Package for Bayesian Multilevel Models Using Stan , 2017 .

[62]  Thomas Givon,et al.  Markedness in Grammar: Distributional, Communicative and Cognitive Correlates of Syntactic Structure , 1991 .

[63]  David J. Schwab,et al.  The Deterministic Information Bottleneck , 2015, Neural Computation.

[64]  Terry Regier,et al.  Word Meanings across Languages Support Efficient Communication , 2015 .

[65]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[66]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[67]  Jason Eisner,et al.  The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages , 2016, TACL.

[68]  Daniel Gildea,et al.  Optimizing Grammars for Minimum Dependency Length , 2007, ACL.

[69]  Jim W Kay,et al.  Coherent Infomax as a Computational Goal for Neural Systems , 2011, Bulletin of mathematical biology.

[70]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[71]  Robin Mitra,et al.  On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression , 2015, Bayesian Analysis.

[72]  Jan Rijkhoff,et al.  Word Order Universals Revisited: The Principle of Head Proximity , 1986 .

[73]  M. Baker,et al.  The Macroparameter in a Microparametric World , 2008 .

[74]  W. Bruce Croft,et al.  Grammar: Functional Approaches , 2001 .

[75]  Steve B. Jiang,et al.  Nonlinear Systems Identification Using Deep Dynamic Neural Networks , 2016, ArXiv.

[76]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.

[77]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[78]  Noah D. Goodman,et al.  Extremely costly intensifiers are stronger than quite costly ones , 2018, Cognition.

[79]  A. Goldberg Constructions at Work: The Nature of Generalization in Language , 2006 .

[80]  S. Kirby,et al.  Compression and communication in the cultural evolution of linguistic structure , 2015, Cognition.

[81]  John Mace Persian Grammar: For Reference and Revision , 2002 .

[82]  W. Strange Evolution of language. , 1984, JAMA.

[83]  Holger Diessel,et al.  The Ordering Distribution of Main and Adverbial Clauses: A Typological Study , 2001 .

[84]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[85]  Haitao Liu,et al.  Dependency Distance as a Metric of Language Comprehension Difficulty , 2008 .

[86]  Haitao Liu,et al.  The risks of mixing dependency lengths from sequences of different length , 2013, ArXiv.

[87]  Ricard V. Solé,et al.  Least effort and the origins of scaling in human language , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[89]  W. Bruce Croft,et al.  Greenbergian universals, diachrony, and statistical analyses , 2011 .

[90]  Nathaniel J. Smith,et al.  The effect of word predictability on reading time is logarithmic , 2013, Cognition.

[91]  Joseph H. Greenberg,et al.  Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements , 1990, On Language.

[92]  Antje Lahne,et al.  The locality of syntactic dependencies , 2013 .

[93]  Lyn Frazier,et al.  Natural language parsing: Syntactic complexity , 1985 .

[94]  Winfred P. Lehmann,et al.  A Structural Principle of Language and Its Implications. , 1973 .

[95]  Roger Levy,et al.  Computational methods are invaluable for typology, but the models must match the questions , 2011 .

[96]  MARTIN HASPELMATH,et al.  1 Against markedness ( and what to replace it with ) 1 MARTIN HASPELMATH , 2005 .

[97]  Charles Kemp,et al.  Kinship Categories Across Languages Reflect General Communicative Principles , 2012, Science.

[98]  David Temperley,et al.  Dependency-length minimization in natural and artificial languages* , 2008, J. Quant. Linguistics.

[99]  Robert C. Berwick,et al.  The Grammatical Basis of Linguistic Performance: Language Use and Acquisition , 1986 .

[100]  Mark C. Baker The Atoms of Language , 1987 .

[101]  James McElvenny,et al.  Die Sprachwissenschaft: Ihre Aufgaben, Methoden Und Bisherigen Ergebnisse , 2015 .

[102]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[103]  Simon J. Greenhill,et al.  Evolved structure of language shows lineage-specific trends in word-order universals , 2011, Nature.

[104]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[105]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[106]  Lucien Tesnière,et al.  Elements of Structural Syntax , 2015 .

[107]  Matthew S. Dryer,et al.  The evidence for word order correlations , 2011 .

[108]  J. Hawkins Efficiency and complexity in grammars , 2004 .

[109]  Michael C. Frank,et al.  The interactions of rational, pragmatic agents lead to efficient language structure and use , 2019, CogSci.

[110]  Daniel Jurafsky,et al.  An Information-Theoretic Explanation of Adjective Ordering Preferences , 2018, CogSci.

[111]  Ryan Cotterell,et al.  What Kind of Language Is Hard to Language-Model? , 2019, ACL.

[112]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[113]  Scott McGlashan,et al.  Heads in grammatical theory , 1993 .

[114]  Frank Keller,et al.  Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , 2008, Cognition.

[115]  Christopher T. Kello,et al.  On the physical origin of linguistic laws and lognormality in speech , 2019, Royal Society Open Science.

[116]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[117]  Zdenek Zabokrtský,et al.  Tamil Dependency Parsing: Results Using Rule Based and Corpus Based Approaches , 2011, CICLing.

[118]  Matthew S. Dryer,et al.  The positional tendencies of sentential noun phrases in universal grammar , 1980, Canadian Journal of Linguistics/Revue canadienne de linguistique.

[119]  Haitao Liu,et al.  Dependency distance: A new perspective on syntactic patterns in natural languages. , 2017, Physics of life reviews.

[120]  Daniel Gildea,et al.  Do Grammars Minimize Dependency Length? , 2010, Cogn. Sci..

[121]  Martin Haspelmath,et al.  Parametric versus functional explanations of syntactic universals , 2008 .

[122]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[123]  Richard L. Lewis,et al.  An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval , 2005, Cogn. Sci..

[124]  Yang Xu,et al.  Historical Semantic Chaining and Efficient Communication: The Case of Container Names , 2016, Cogn. Sci..

[125]  S. Frank,et al.  Insensitivity of the Human Sentence-Processing System to Hierarchical Structure , 2011, Psychological science.

[126]  Charles Kemp,et al.  Efficient compression in color naming and its evolution , 2018, Proceedings of the National Academy of Sciences.

[127]  Mirella Lapata,et al.  Dependency Parsing as Head Selection , 2016, EACL.

[128]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[129]  Richard S. Kayne Why Are There No Directionality Parameters , 2010 .

[130]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[131]  Richard S. Kayne ANTISYMMETRY AND JAPANESE , 2003 .

[132]  Daniel Gildea,et al.  Human languages order information efficiently , 2015, ArXiv.

[133]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[134]  M. Dryer On the order of demonstrative, numeral, adjective, and noun , 2018 .

[135]  David R. Dowty,et al.  Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives , 1985 .

[136]  Justin A. Sirignano,et al.  Universal features of price formation in financial markets: perspectives from deep learning , 2018, Machine Learning and AI in Finance.

[137]  Yijia Liu,et al.  Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation , 2018, CoNLL.

[138]  Howard Lasnik The Theory of Principles and Parameters , 2014 .

[139]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[140]  Daniel Zeman,et al.  HamleDT: To Parse or Not to Parse? , 2012, LREC.

[141]  Michael C. Frank,et al.  Predicting Pragmatic Reasoning in Language Games , 2012, Science.

[142]  G. Cinque Deriving Greenberg's Universal 20 and Its Exceptions , 2005, Linguistic Inquiry.

[143]  M. Dryer The Greenbergian word order correlations , 1992 .

[144]  Richard Futrell Memory and locality in natural language , 2017 .

[145]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[146]  Richard Futrell,et al.  Large-scale evidence of dependency length minimization in 37 languages , 2015, Proceedings of the National Academy of Sciences.

[147]  Noah D. Goodman,et al.  Nonliteral understanding of number words , 2014, Proceedings of the National Academy of Sciences.

[148]  Roger Levy,et al.  Noisy-context surprisal as a human sentence processing cost model , 2017, EACL.

[149]  Noah D. Goodman,et al.  Knowledge and implicature: Modeling language understanding as social cognition , 2012, CogSci.

[150]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[151]  Yuji Matsumoto,et al.  Universal Dependencies 2.1 , 2017 .