WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the $9$ largest language editions. The dataset contains yearly snapshots of the network and spans $17$ years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present descriptive statistics of several snapshots of this network. Finally, we propose several research opportunities that can be explored using this new dataset.

[1]  Massimo Franceschet,et al.  PageRank , 2010, Commun. ACM.

[2]  Michaël,et al.  Seeking health information online: does Wikipedia matter? , 2009, Journal of the American Medical Informatics Association : JAMIA.

[3]  Steffen Staab,et al.  Structural Dynamics of Knowledge Networks , 2013, ICWSM.

[4]  Erik Borra,et al.  Digging Wikipedia , 2017 .

[5]  Guido Caldarelli,et al.  Preferential attachment in the growth of social networks: the case of Wikipedia , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Diego Reforgiato Recupero,et al.  Uncovering the Semantics of Wikipedia Pagelinks , 2014, EKAW.

[7]  Yana Volkovich,et al.  Biographical social networks on Wikipedia: a cross-cultural study of links that made history , 2012, WikiSym '12.

[8]  Eneko Agirre,et al.  WikiWalk: Random walks on Wikipedia for Semantic Relatedness , 2009, Graph-based Methods for Natural Language Processing.

[9]  Dima Shepelyansky,et al.  Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions , 2014, PloS one.

[10]  Roger Guimerà,et al.  Missing and spurious interactions and the reconstruction of complex networks , 2009, Proceedings of the National Academy of Sciences.

[11]  Jure Leskovec,et al.  Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia , 2015, WWW.

[12]  Michaël Laurent,et al.  Research Paper: Seeking Health Information Online: Does Wikipedia Matter? , 2009, J. Am. Medical Informatics Assoc..

[13]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Processing Engines , 2018, ICDE.

[14]  Massimo Mecella,et al.  Disconnection prediction in mobile ad hoc networks for supporting cooperative work , 2005, IEEE Pervasive Computing.

[15]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[16]  Giulio Cimini,et al.  Removing spurious interactions in complex networks , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Jure Leskovec,et al.  Human wayfinding in information networks , 2012, WWW.

[18]  Anselm Spoerri,et al.  What is popular on Wikipedia and why? , 2007, First Monday.

[19]  Dirk Lewandowski,et al.  Ranking of Wikipedia articles in search engines revisited: Fair ranking for reasonable quality? , 2011, J. Assoc. Inf. Sci. Technol..

[20]  Kate Revoredo,et al.  Semantic Unlink Prediction in Evolving Social Networks through Probabilistic Description Logic , 2014, 2014 Brazilian Conference on Intelligent Systems.

[21]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[22]  David Laniado,et al.  Contrasting medium and genre on Wikipedia to open up the dominating definition and classification of geoengineering , 2016, Big Data Soc..

[23]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Paolo Ciuccarelli,et al.  Societal Controversies in Wikipedia Articles , 2015, CHI.

[26]  Jure Leskovec,et al.  Growing Wikipedia Across Languages via Recommendation , 2016, WWW.

[27]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[28]  Aaron D. Shaw,et al.  Consider the Redirect: A Missing Dimension of Wikipedia Research , 2014, OpenSym.