Wikipedia Information Flow Analysis Reveals the Scale-Free Architecture of the Semantic Space

In this paper we extract the topology of the semantic space in its encyclopedic acception, measuring the semantic flow between the different entries of the largest modern encyclopedia, Wikipedia, and thus creating a directed complex network of semantic flows. Notably at the percolation threshold the semantic space is characterised by scale-free behaviour at different levels of complexity and this relates the semantic space to a wide range of biological, social and linguistics phenomena. In particular we find that the cluster size distribution, representing the size of different semantic areas, is scale-free. Moreover the topology of the resulting semantic space is scale-free in the connectivity distribution and displays small-world properties. However its statistical properties do not allow a classical interpretation via a generative model based on a simple multiplicative process. After giving a detailed description and interpretation of the topological properties of the semantic space, we introduce a stochastic model of content-based network, based on a copy and mutation algorithm and on the Heaps' law, that is able to capture the main statistical properties of the analysed semantic space, including the Zipf's law for the word frequency distribution.

[1]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[2]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[3]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[4]  Vittorio Loreto,et al.  Modeling the emergence of universality in color naming patterns , 2009, Proceedings of the National Academy of Sciences.

[5]  Emilio Hernández-García,et al.  Extracting directed information flow networks: an application to genetics and semantics , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  G. J. Rodgers,et al.  Multi-directed Eulerian growing networks , 2007, physics/0702097.

[7]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[8]  David Randall,et al.  Margins of Philosophy , 1988 .

[9]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[10]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[11]  F. Guattari,et al.  A Thousand Plateaus: Capitalism and Schizophrenia , 1980 .

[12]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[13]  Umberto Eco,et al.  Semiotics and the philosophy of language , 1985, Advances in semiotics.

[14]  Vito Latora,et al.  Networks of motifs from sequences of symbols. , 2010, Physical review letters.

[15]  G. Ascoli,et al.  Principal Semantic Components of Language and the Measurement of Meaning , 2010, PloS one.

[16]  Yoram Louzoun,et al.  Self-emergence of knowledge trees: extraction of the Wikipedia hierarchies. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Dag Prawitz Meaning and experience , 2005, Synthese.

[18]  Muhittin Mungan,et al.  Analytical solution of a stochastic content-based network model , 2004, q-bio/0406049.

[19]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[20]  Theo P. van der Weide,et al.  A formal derivation of Heaps' Law , 2005, Inf. Sci..

[21]  G. Caldarelli,et al.  Preferential attachment in the growth of social networks, the Internet encyclopedia wikipedia , 2007 .

[22]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[23]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[24]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[25]  J. Borge-Holthoefer,et al.  Categorizing words through semantic memory navigation , 2010 .

[26]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[27]  S. Redner,et al.  Introduction To Percolation Theory , 2018 .

[28]  Brian Skyrms,et al.  Signals: Evolution, Learning, and Information , 2010 .

[29]  Marcelo A. Montemurro,et al.  Dynamics of Text Generation with Realistic Zipf's Distribution , 2002, J. Quant. Linguistics.

[30]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Filippo Menczer,et al.  Growing and navigating the small world Web by local content , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[33]  K. Aaron Smith,et al.  Grammaticalization , 2011, Lang. Linguistics Compass.

[34]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[35]  J. Ramasco,et al.  Inversion method for content-based networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[36]  D. Balcan,et al.  The Information Coded in the Yeast Response Elements Accounts for Most of the Topological Properties of Its Transcriptional Regulation Network , 2007, PloS one.

[37]  W. Fitch,et al.  Linguistics: An invisible hand , 2007, Nature.

[38]  A.-L. Barabasi,et al.  Minimum spanning trees of weighted scale-free networks , 2004 .

[39]  Marcelo A. Montemurro,et al.  Towards the Quantification of the Semantic Information Encoded in Written Language , 2009, Adv. Complex Syst..

[40]  S. Redner,et al.  Infinite-order percolation and giant fluctuations in a protein interaction network. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Albert-László Barabási,et al.  Evolution of Networks: From Biological Nets to the Internet and WWW , 2004 .

[42]  Evandro Eduardo Seron Ruiz,et al.  Thesaurus as a complex network , 2004 .

[43]  R. Prim Shortest connection networks and some generalizations , 1957 .

[44]  Mariano Sigman,et al.  Global organization of the Wordnet lexicon , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[45]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[46]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .