The impact of lacking metadata and data truncation for the measurement of cultural and linguistic change using the Google Ngram datasets

As a result of legal restrictions the Google Ngram Corpora datasets are a) not accompanied by any metadata regarding the texts the corpora consist of and the data are b) truncated to prevent an indirect conclusion from the n-gram to the author of the text. Some of the consequences of this strategy are discussed in this article.

[1]  Rudolf P. Huebener,et al.  A focus of discoveries , 2008 .

[2]  P. Phillips Testing for a Unit Root in Time Series Regression , 1988 .

[3]  Matjaz Perc,et al.  Evolution of the most common English words and phrases over the centuries , 2012, Journal of The Royal Society Interface.

[4]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[5]  B. F. Frederiksen Jomo Kenyatta, Marie Bonaparte and Bronislaw Malinowski on Clitoridectomy and Female Sexuality , 2008, History workshop journal : HWJ.

[6]  Selin Kesebir,et al.  The cultural salience of moral character and virtue declined in twentieth century America , 2012 .

[7]  Klaus von Heusinger,et al.  Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communication Science , 2011 .

[8]  Douglas Biber,et al.  Being Specific about Historical Change , 2013 .

[9]  Joan L. Bybee,et al.  Mechanisms of Change in Grammaticization: The Role of Frequency , 2008 .

[10]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[11]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[12]  Douglas Biber,et al.  Register as a predictor of linguistic variation , 2012 .

[13]  Vasileios Lampos,et al.  The Expression of Emotions in 20th Century Books , 2013, PloS one.

[14]  W. Keith Campbell,et al.  Increases in Individualistic Words and Phrases in American Books, 1960–2008 , 2012, PloS one.

[15]  Eduardo G. Altmann,et al.  Stochastic model for the vocabulary growth in natural languages , 2012, ArXiv.

[16]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[17]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[18]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[19]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[20]  V Bochkarev,et al.  Universals versus historical contingencies in lexical evolution , 2014, Journal of The Royal Society Interface.

[21]  Adam Kilgarriff,et al.  Putting frequencies in the dictionary , 1997 .

[22]  Paula Buttery,et al.  Zipf's law and the grammar of languages: A quantitative study of Old and Modern English parallel texts , 2014 .

[23]  Jing Hu,et al.  Culturomics meets random fractal theory: insights into long-range correlations of social and natural phenomena over the past two centuries , 2012, Journal of The Royal Society Interface.

[24]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[25]  Sean Becketti,et al.  Introduction to Time Series Using Stata , 2013 .

[26]  Howard Zinn,et al.  A People''''s History of the United States: 1492 to Present , 1995 .

[27]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[28]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.

[29]  Erez Lieberman Aiden,et al.  Uncharted: Big Data as a Lens on Human Culture , 2013 .

[30]  Kerstin Fischer,et al.  Does frequency in text instantiate entrenchment in the cognitive system , 2010 .

[31]  Patrick Juola,et al.  Using the Google N-Gram corpus to measure cultural complexity , 2013, Lit. Linguistic Comput..

[32]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[33]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[34]  M. Ravallion The Two Poverty Enlightenments: Historical Insights from Digitized Books Spanning Three Centuries , 2011 .

[35]  R. Ferrer-i-Cancho,et al.  The Evolution of the Exponent of Zipf's Law in Language Ontogeny , 2013, PloS one.

[36]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[37]  Matthew L. Jockers Macroanalysis: Digital Methods and Literary History , 2013 .

[38]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[39]  Marc Kupietz,et al.  The German Reference Corpus DeReKo: A Primordial Sample for Linguistic Research , 2010, LREC.

[40]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[41]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[42]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[43]  Paul Ormerod,et al.  Books Average Previous Decade of Economic Misery , 2014, PloS one.