Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

[1]  D. Price Little Science, Big Science , 1965 .

[2]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[3]  A. Tellegen,et al.  in Psychological Science , 1996 .

[4]  W. Keith Campbell,et al.  Increases in Individualistic Words and Phrases in American Books, 1960–2008 , 2012, PloS one.

[5]  R. Perrucci,et al.  From Little Science to Big Science , 2017 .

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Christopher M. Danforth,et al.  Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter , 2011, PloS one.

[8]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[9]  Alexander Koplenig The impact of lacking metadata and data truncation for the measurement of cultural and linguistic change using the Google Ngram datasets , 2014 .

[10]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[11]  Min Zhang Proceedings of the ACL 2012 System Demonstrations , 2012 .

[12]  L. Goddard Information Theory , 1962, Nature.

[13]  Alexander Koplenig,et al.  The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets - Reconstructing the composition of the German corpus in times of WWII , 2015, Digit. Scholarsh. Humanit..

[14]  W. K. Campbell,et al.  Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900–2008 , 2012 .

[15]  Eduardo G. Altmann,et al.  Stochastic model for the vocabulary growth in natural languages , 2012, ArXiv.

[16]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.

[17]  Matthew J. Salganik,et al.  Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market , 2006, Science.

[18]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[19]  Mirco Musolesi Questions and Comments Cis6930 Presentation Pertaining to Publication: Mirco Musolesi , Cecilia Mascolo, Designing Mobility Models Based on Social Network Theory, Acm Sigmobile Mobile Computing and Communications Review , 2009 .

[20]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[21]  Paul Ormerod,et al.  Books Average Previous Decade of Economic Misery , 2014, PloS one.

[22]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[23]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[24]  P. Greenfield The Changing Psychology of Culture From 1800 Through 2000 , 2013, Psychology Science.

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[26]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .