Quantitative Analysis of Culture Using Millions of Digitized Books

Linguistic and cultural changes are revealed through the analyses of words appearing in books. We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

[1]  D. Wilkin,et al.  Neuron , 2001, Brain Research.

[2]  J. Barry,et al.  The great influenza : the epic story of the deadliest plague in history , 2004 .

[3]  John Algeo,et al.  Fifty years among the new words : a dictionary of neologisms, 1941-1991 , 1993 .

[4]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[5]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[6]  Lewis A. Coser,et al.  Collective Memory , 2022, Progress in Brain Research.

[7]  Gerhard Sauder Die Bücherverbrennung : 10. Mai 1933 , 1985 .

[8]  Stephanie Barron,et al.  Degenerate Art: The Fate of the Avant-Garde in Nazi Germany , 1991 .

[9]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[10]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[11]  Partha Niyogi,et al.  Book Reviews: The Computational Nature of Language Learning and Evolution, by Partha Niyogi , 2007, CL.

[12]  Stanley Lieberson,et al.  Implication Analysis: A Pragmatic Proposal for Linking Theory and Data in the Social Sciences , 2008 .

[13]  Raymond Smith,et al.  Adapting the Tesseract open source OCR engine for multilingual OCR , 2009, MOCR '09.

[14]  H. Stanley,et al.  The growth of business firms: theoretical framework and empirical evidence. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[16]  Hermann Ebbinghaus,et al.  Memory: a contribution to experimental psychology. , 1987, Annals of neurosciences.

[17]  Anne H. Soukhanov,et al.  The american heritage dictionary of the english language , 1992 .

[18]  S. Ulam John von Neumann 1903-1957 , 1958 .

[19]  Allen Walker Read The Scope of the American Dictionary , 1933 .

[20]  A. Kroch Reflexes of grammar in patterns of language change , 1989, Language Variation and Change.

[21]  Joan L. Bybee,et al.  From Usage to Grammar: The Mind's Response to Repetition , 2007 .

[22]  Leo Braudy The Frenzy of Renown: Fame and Its History , 1986 .

[23]  E. Kandel,et al.  Cognitive Neuroscience and the Study of Memory , 1998, Neuron.

[24]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[25]  G. Āllport The Psycho-Biology of Language. , 1936 .

[26]  L. Cavalli-Sforza Cultural transmission and evolution , 1981 .

[27]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[28]  D. Sperber,et al.  Anthropology and Psychology: Towards an Epidemiology of Representations , 1985 .

[29]  Steven Pinker,et al.  Words and rules , 1998 .

[30]  P. Gove Webster's third new international dictionary of the English language, unabridged, with seven language dictionary , 1976 .

[31]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.