Modeling Statistical Properties of Written Text

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Denise Brandão de Oliveira e Britto,et al.  The faculty of language , 2007 .

[3]  Hajo Hippner,et al.  Text Mining , 2006, Informatik-Spektrum.

[4]  James L. Dolby,et al.  Programming languages in mechanized documentation , 1971 .

[5]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[8]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[9]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[10]  Gordon D. A. Brown,et al.  Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times , 2006, Psychological science.

[11]  Noam Chomsky,et al.  Language and Mind , 1973 .

[12]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[13]  W S Murray,et al.  Serial mechanisms in lexical access: the rank hypothesis. , 2004, Psychological review.

[14]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[15]  Victor Pavlovich Maslov,et al.  On Zipf’s law and rank distributions in linguistics and semiotics , 2006 .

[16]  J-P Eckmann,et al.  Hierarchical structures induce long-range dynamical correlations in written texts. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[19]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[20]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[21]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[22]  P. Niyogi,et al.  Computational and evolutionary aspects of language , 2002, Nature.

[23]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[24]  Martin Jansche,et al.  Parametric Models of Linguistic Count Data , 2003, ACL.

[25]  Noam Chomsky Language and Mind: Index , 2006 .

[26]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[27]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[28]  Filippo Menczer,et al.  Growing and navigating the small world Web by local content , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[30]  Filippo Menczer,et al.  Evolution of document networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[32]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[33]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[34]  Santo Fortunato,et al.  Scale-free network growth by ranking. , 2006, Physical review letters.

[35]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  K. Goh,et al.  Universal behavior of load distribution in scale-free networks. , 2001, Physical review letters.

[37]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[38]  Paul H. Garthwaite,et al.  A Bayesian Mixture Model for Term Re-occurrence and Burstiness , 2005, CoNLL.

[39]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[40]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[41]  Hsinchun Chen,et al.  Intelligence and Security Informatics for International Security: Information Sharing and Data Mining (Integrated Series in Information Systems) , 2006 .

[42]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[43]  Vittorio Loreto,et al.  Semiotic dynamics and collaborative tagging , 2006, Proceedings of the National Academy of Sciences.

[44]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[45]  Didier Sornette,et al.  Theory of Zipf's Law and of General Power Law Distributions with Gibrat's law of Proportional Growth , 2008, 0808.1828.

[46]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[47]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[48]  Simon J. Greenhill,et al.  Languages Evolve in Punctuational Bursts , 2008, Science.

[49]  Vittorio Loreto,et al.  Collaborative Tagging and Semiotic Dynamics , 2006, ArXiv.

[50]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[51]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.