Population Size Predicts Lexical Diversity, but so Does the Mean Sea Level – Why It Is Important to Correctly Account for the Structure of Temporal Data

In order to demonstrate why it is important to correctly account for the (serial dependent) structure of temporal data, we document an apparently spectacular relationship between population size and lexical diversity: for five out of seven investigated languages, there is a strong relationship between population size and lexical diversity of the primary language in this country. We show that this relationship is the result of a misspecified model that does not consider the temporal aspect of the data by presenting a similar but nonsensical relationship between the global annual mean sea level and lexical diversity. Given the fact that in the recent past, several studies were published that present surprising links between different economic, cultural, political and (socio-)demographical variables on the one hand and cultural or linguistic characteristics on the other hand, but seem to suffer from exactly this problem, we explain the cause of the misspecification and show that it has profound consequences. We demonstrate how simple transformation of the time series can often solve problems of this type and argue that the evaluation of the plausibility of a relationship is important in this context. We hope that our paper will help both researchers and reviewers to understand why it is important to use special models for the analysis of data with a natural temporal ordering.

[1]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[2]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[3]  Karl Aquino,et al.  A decline in prosocial language helps explain public disapproval of the US Congress , 2015, Proceedings of the National Academy of Sciences.

[4]  Simon Kirby,et al.  Speaker Input Variability Does Not Explain Why Larger Populations Have Simpler Languages , 2015, PloS one.

[5]  Thomas T. Hills,et al.  Recent evolution of learnability in American English from 1800 to 2000 , 2015, Cognition.

[6]  Sean Becketti,et al.  Introduction to Time Series Using Stata , 2013 .

[7]  C. Granger,et al.  Spurious regressions in econometrics , 1974 .

[8]  Robert Mailhammer,et al.  Population Size and Rates of Language Change , 2016 .

[9]  Alexander Koplenig,et al.  The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets - Reconstructing the composition of the German corpus in times of WWII , 2015, Digit. Scholarsh. Humanit..

[10]  Paul Ormerod,et al.  Books Average Previous Decade of Economic Misery , 2014, PloS one.

[11]  Marco Baroni,et al.  37. Distributions in text , 2009 .

[12]  James Winters,et al.  Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits , 2013, PloS one.

[13]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[14]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[15]  Susanne Gahl,et al.  The Ecclesiastes principle in language change , 2017 .

[16]  Panagiotis Papapetrou,et al.  Significance testing of word frequencies in corpora , 2016, Digit. Scholarsh. Humanit..

[17]  Martin H. Levinson Not by Genes Alone: How Culture Transformed Human Evolution , 2006 .

[18]  Lawrence C. Hamilton,et al.  Statistics with Stata : updated for version 12 , 2013 .

[19]  Patrick Juola,et al.  Using the Google N-Gram corpus to measure cultural complexity , 2013, Lit. Linguistic Comput..

[20]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[21]  D. Nettle Is the rate of linguistic change constant , 1999 .

[22]  The Changing English Language: Psycholinguistic Perspectives , 2017 .

[23]  W. Labov Principles Of Linguistic Change , 1994 .

[24]  P. Greenfield,et al.  Cultural evolution over the last 40 years in China: using the Google Ngram Viewer to study implications of social and political change for cultural values. , 2015, International journal of psychology : Journal international de psychologie.

[25]  R. Kopp,et al.  Probabilistic reanalysis of twentieth-century sea-level rise , 2015, Nature.

[26]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[27]  W. K. Campbell,et al.  Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900–2008 , 2012 .

[28]  Vaclav Brezina,et al.  Significant or random?: A critical review of sociolinguistic generalisations based on large corpora , 2014 .

[29]  P. Richerson,et al.  Not by genes alone: How culture transformed human evolution. , 2004 .

[30]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[31]  Simon J. Greenhill,et al.  Rate of language evolution is affected by population size , 2015, Proceedings of the National Academy of Sciences.

[32]  William E. Griffiths,et al.  Principles of Econometrics , 2008 .

[33]  W. Bruce Croft,et al.  Language Is a Complex Adaptive System: Position Paper , 2009 .

[34]  Alexander Koplenig,et al.  Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis , 2018 .

[35]  Paul Caruana-Galizia,et al.  Politics and the German language: Testing Orwell's hypothesis using the Google N-Gram corpus , 2016, Digit. Scholarsh. Humanit..

[36]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[37]  G. Yule Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series , 1926 .