Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words

Background Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well. Methodology/Principal Findings By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type – a measure of the logicality of each word – and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage. Conclusions/Significance Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

[1]  Irene Heim,et al.  File Change Semantics and the Familiarity Theory of Definiteness , 2008 .

[2]  Holger Kantz,et al.  Return interval distribution of extreme events and long-term memory. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Lorraine K Tyler,et al.  Morphology, language and the brain: the decompositional substrate for language comprehension , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[4]  M. Rosner,et al.  Computational linguistics and formal semantics , 1992 .

[5]  J. Wixted,et al.  On the Form of Forgetting , 1991 .

[6]  Susan J. Hespos,et al.  Conceptual precursors to language , 2004, Nature.

[7]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[8]  Adilson E. Motter,et al.  A Poissonian explanation for heavy tails in e-mail communication , 2008, Proceedings of the National Academy of Sciences.

[9]  P. Bak,et al.  Unified scaling law for earthquakes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Gemma Boleda,et al.  Universal Complex Structures in Written Language , 2009, ArXiv.

[11]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[12]  J. Hay Causes and Consequences of Word Structure , 2003 .

[13]  Marcelo A. Montemurro,et al.  Entropic Analysis of the Role of Words in Literary Texts , 2001, Adv. Complex Syst..

[14]  Kwang-Il Goh,et al.  Burstiness and memory in complex systems , 2006 .

[15]  W. Kruskal Relative Importance by Averaging Over Orderings , 1987 .

[16]  Pedro A. Pury,et al.  Statistical keyword detection in literary corpora , 2007, ArXiv.

[17]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[18]  Shlomo Havlin,et al.  Long-term memory: a natural mechanism for the clustering of extreme events and anomalous residual times in climate records. , 2005, Physical review letters.

[19]  Mike Thelwall,et al.  Word statistics in Blogs and RSS feeds: Towards empirical universal evidence , 2007, J. Informetrics.

[20]  Richard Montague,et al.  The Proper Treatment of Quantification in Ordinary English , 1973 .

[21]  D. Sornette,et al.  Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales , 1998, cond-mat/9801293.

[22]  C. F. Hockett The origin of speech. , 1960, Scientific American.

[23]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Jason M. Brenier,et al.  Predictability Effects on Durations of Content and Function Words in Conversational English , 2009 .

[26]  Fang Wu,et al.  Novelty and collective attention , 2007, Proceedings of the National Academy of Sciences.

[27]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[28]  Patrick Suppes,et al.  Approaches to Natural Language , 1973 .

[29]  Kim Christensen,et al.  Editorial note: Unified scaling law for earthquakes [Phys. Rev. Lett. 88, 178501 (2002)]. , 2003, Physical review letters.

[30]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[31]  Paul H. Garthwaite,et al.  A Bayesian Mixture Model for Term Re-occurrence and Burstiness , 2005, CoNLL.

[32]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[33]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[34]  I. Heim E-Type pronouns and donkey anaphora , 1990 .

[35]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[36]  S W Elliott,et al.  Effect of memory decay on predictions from changing categories. , 1995, Journal of experimental psychology. Learning, memory, and cognition.

[37]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[38]  Christoph Schwarze,et al.  Meaning, Use, and Interpretation of Language , 1983 .

[39]  Johan van Benthem,et al.  Logical Constants Across Varying Types , 1989, Notre Dame J. Formal Log..

[40]  William D. Marslen-Wilson,et al.  Ambiguity and frequency effects in regular verb inflection , 2001 .

[41]  K. Fintel The Formal Semantics of Grammaticalization , 1995 .

[42]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[43]  James L. McClelland,et al.  ‘Words or Rules’ cannot exploit the regularity in exceptions , 2002, Trends in Cognitive Sciences.

[44]  Nikos Yannaros,et al.  Weibull renewal processes , 1994 .

[45]  Sidney Redner,et al.  A guide to first-passage processes , 2001 .

[46]  Paul D. Elbourne,et al.  The Interpretation of Pronouns , 2008, Lang. Linguistics Compass.

[47]  C. Goodwin Action and embodiment within situated human interaction , 2000 .

[48]  John R. Anderson,et al.  Human memory: An adaptive perspective. , 1989 .

[49]  S. Redner A guide to first-passage processes , 2001 .

[50]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[51]  Albert-László Barabási,et al.  Modeling bursts and heavy tails in human dynamics , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[52]  W. Quine,et al.  The web of belief , 1970 .

[53]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[54]  Gary F. Marcus,et al.  German Inflection: The Exception That Proves the Rule , 1995, Cognitive Psychology.

[55]  J-P Eckmann,et al.  Hierarchical structures induce long-range dynamical correlations in written texts. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Jeroen Groenendijk,et al.  Formal methods in the study of language , 1983 .

[57]  James L. McClelland,et al.  Rules or connections in past-tense inflections: what does the evidence rule out? , 2002, Trends in Cognitive Sciences.

[58]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[59]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[60]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[61]  Enrico Scalas,et al.  Fitting the empirical distribution of intertrade durations , 2008 .

[62]  Johan van Benthem,et al.  Fine-structure in categorial semantics , 1992 .

[63]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[64]  Denise Brandão de Oliveira e Britto,et al.  The faculty of language , 2007 .

[65]  S. Pinker,et al.  The past and future of the past tense , 2002, Trends in Cognitive Sciences.

[66]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[67]  Eric T. Bradlow,et al.  Count Models Based on Weibull Interarrival Times , 2008, 1307.5759.

[68]  H. Kamp A Theory of Truth and Semantic Representation , 2008 .

[69]  Jeffrey R. Russell,et al.  Autoregressive Conditional Duration: A New Model for Irregularly Spaced Transaction Data , 1998 .

[70]  Sarah Brown-Schmidt,et al.  Language processing in the natural world , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[71]  Barbara H. Partee,et al.  Syntactic categories and semantic type , 1992 .

[72]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[73]  D. Watts A twenty-first century science , 2007, Nature.