Robust clustering of languages across Wikipedia growth

Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of these data, consisting of 14 962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via four different clustering methods. Interestingly, the identified Wikipedia clusters have little correlation with language families and groups. Rather, the growth of Wikipedia across different languages is governed by different factors, ranging from similarities in culture to information literacy.

[1]  Marija Mitrovic,et al.  Quantitative analysis of bloggers’ collective behavior powered by emotions , 2010, ArXiv.

[2]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[3]  W. Hamilton,et al.  The evolution of cooperation. , 1984, Science.

[4]  William Bright,et al.  Classification and Index of the World's Languages , 1978 .

[5]  S. Hrdy Mothers and Others: The Evolutionary Origins of Mutual Understanding , 2009 .

[6]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[7]  Taha Yasseri,et al.  The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics , 2013, EPJ Data Science.

[8]  Z. Wang,et al.  The structure and dynamics of multilayer networks , 2014, Physics Reports.

[9]  H. Lehman,et al.  The Exponential Increase Of Man's Cultural Output , 1947 .

[10]  L. Steels,et al.  Social dynamics: Emergence of language , 2007 .

[11]  S. Strogatz,et al.  Linguistics: Modelling the dynamics of language death , 2003, Nature.

[12]  R. N. Indah Language and Speech , 1958, Nature.

[13]  Minjae Lee,et al.  RNA design rules from a massive open laboratory , 2014, Proceedings of the National Academy of Sciences.

[14]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[15]  Kevin Zeng Hu,et al.  Pantheon: Visualizing historical cultural production , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[16]  Ed H. Chi,et al.  The singularity is not near: slowing growth of Wikipedia , 2009, Int. Sym. Wikis.

[17]  Ricard V Solé,et al.  Diversity, competition, extinction: the ecophysics of language change , 2010, Journal of The Royal Society Interface.

[18]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[19]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[20]  G. Caldarelli,et al.  The spreading of misinformation online , 2016, Proceedings of the National Academy of Sciences.

[21]  Laurent Hébert-Dufresne,et al.  Enhancing disease surveillance with novel data streams: challenges and opportunities , 2015, EPJ Data Science.

[22]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[23]  Alison R Hwong,et al.  Mothers and Others: The Evolutionary Origins of Mutual Understanding , 2010 .

[24]  Guido Caldarelli,et al.  Trend of Narratives in the Age of Misinformation , 2015, PloS one.

[25]  Lourdes Araujo,et al.  Social Experiments in the Mesoscale: Humans Playing a Spatial Prisoner's Dilemma , 2010, PloS one.

[26]  András Kornai,et al.  Dynamics of Conflicts in Wikipedia , 2012, PloS one.

[27]  Taha Yasseri,et al.  Circadian Patterns of Wikipedia Editorial Activity: A Demographic Analysis , 2011, PloS one.

[28]  Attila Szolnoki,et al.  Collective influence in evolutionary social dilemmas , 2016, ArXiv.

[29]  Jari Saramäki,et al.  Inferring human mobility using communication patterns , 2014, Scientific Reports.

[30]  Gerardo Iñiguez,et al.  Modeling social dynamics in a collaborative environment , 2014, EPJ Data Science.

[31]  A. Hama Supercooperators: Altruism, Evolution, and Why We Need Each Other to Succeed , 2012 .

[32]  D. Lightfoot The development of language , 1999 .

[33]  Zoran Levnajic,et al.  Modeling crowdsourcing as collective problem solving , 2015, Scientific Reports.

[34]  Jinhyuk Yun,et al.  Early Adhesion of Structural Inequality in the Formation of Collaborative Knowledge, Wikipedia , 2016, ArXiv.

[35]  Mason A. Porter,et al.  Multilayer networks , 2013, J. Complex Networks.

[36]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[37]  P. Niyogi,et al.  Computational and evolutionary aspects of language , 2002, Nature.

[38]  B. Tadić,et al.  Networks and emotion-driven user communities at popular blogs , 2010 .

[39]  Antonio Cabrales,et al.  Three is a crowd in iterated prisoner's dilemmas: experimental evidence on reciprocal behavior , 2012, Scientific Reports.

[40]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[41]  S. Fortunato,et al.  Statistical physics of social dynamics , 2007, 0710.3256.

[42]  Tao Zhou,et al.  Unfolding large-scale online collaborative human dynamics , 2015, Proceedings of the National Academy of Sciences.

[43]  Jacob G. Foster,et al.  Metaknowledge , 2011, Science.

[44]  Attila Szolnoki,et al.  Statistical Physics of Human Cooperation , 2017, ArXiv.

[45]  Ernesto Estrada,et al.  The Structure of Complex Networks: Theory and Applications , 2011 .

[46]  Marija Mitrovic,et al.  Universality in voting behavior: an empirical analysis , 2012, Scientific Reports.

[47]  Simon DeDeo,et al.  The Evolution of Wikipedia's Norm Network , 2015, Future Internet.

[48]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[49]  Zoran Levnajic,et al.  Revealing the Hidden Language of Complex Networks , 2014, Scientific Reports.

[50]  András Kornai,et al.  A Practical Approach to Language Complexity: A Wikipedia Case Study , 2012, PloS one.

[51]  J. Voß Measuring Wikipedia , 2005 .

[52]  Sang Hoon Lee,et al.  Intellectual Interchanges in the History of Massive Online Open-editing Encyclopedia, Wikipedia , 2015, Physical review. E.

[53]  Alice H. Oh,et al.  Understanding Editing Behaviors in Multilingual Wikipedia , 2015, PloS one.

[54]  Dima Shepelyansky,et al.  Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions , 2014, PloS one.

[55]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.

[56]  Dawei Zhao,et al.  Statistical physics of vaccination , 2016, ArXiv.

[57]  Matjaz Perc,et al.  Data from: Robust clustering of languages across Wikipedia growth , 2017 .

[58]  Thomas Chesney,et al.  An empirical examination of Wikipedia's credibility , 2006, First Monday.

[59]  C. F. Voegelin,et al.  Classification And Index Of The World's Languages , 1978 .

[60]  Markus Strohmaier,et al.  Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity , 2016, EPJ Data Science.