A Vision for Performing Social and Economic Data Analysis using Wikipedia's Edit History

In this vision paper, we suggest combining two lines of research to study the collective behavior of Wikipedia contributors. The first line of research analyzes Wikipedia's edit history to quantify the quality of individual contributions and the resulting reputation of the contributor. The second line of research surveys Wikipedia contributors to gain insights, e.g., on their personal and professional background, socioeconomic status, or motives to contribute toWikipedia. While both lines of research are valuable on their own, we argue that the combination of both approaches could yield insights that exceed the sum of the individual parts. Linking survey data to contributor reputation and content-based quality metrics could provide a large-scale, public domain data set to perform user modeling, i.e. deducing interest profiles of user groups. User profiles can, among other applications, help to improve recommender systems. The resulting dataset can also enable a better understanding and improved prediction of high quality Wikipedia content and successfulWikipedia contributors. Furthermore, the dataset can enable novel research approaches to investigate team composition and collective behavior as well as help to identify domain experts and young talents. We report on the status of implementing our large-scale, content-based analysis of the Wikipedia edit history using the big data processing framework Apache Flink. Additionally, we describe our plans to conduct a survey among Wikipedia contributors to enhance the content-based quality metrics.

[1]  András Kornai,et al.  Dynamics of Conflicts in Wikipedia , 2012, PloS one.

[2]  Norman Meuschke,et al.  Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction , 2016, D Lib Mag..

[3]  Luca de Alfaro,et al.  Wikitrust: content-driven reputation for the wikipedia , 2012 .

[4]  Sean W. Smith,et al.  Reputation and Reliability in Collective Goods , 2009 .

[5]  Oded Nov,et al.  What motivates Wikipedians? , 2007, CACM.

[6]  P. Seglen,et al.  Education and debate , 1999, The Ethics of Public Health.

[7]  Sebastian Köhler,et al.  Automatische Reputationsmessung in der Wikipedia , 2011, Wirtschaftsinformatik.

[8]  Heng-Li Yang,et al.  Motivations of Wikipedia content contributors , 2010, Comput. Hum. Behav..

[9]  Mônica G. Campiteli,et al.  An index to quantify an individual's scientific research valid across disciplines , 2005 .

[10]  Jöran Beel,et al.  On the robustness of google scholar against spam , 2010, HT '10.

[11]  Aaron Halfaker,et al.  Who Did What: Editor Role Identification in Wikipedia , 2021, ICWSM.

[12]  Moritz Schubotz,et al.  Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia , 2014, CICM.

[13]  Sudha Ram,et al.  Who does what: Collaboration patterns in the wikipedia and their impact on article quality , 2011, TMIS.

[14]  Yu Suzuki,et al.  Quality Assessment of Wikipedia Articles Using h-index , 2015, J. Inf. Process..

[15]  Henk F. Moed,et al.  The application of bibliometric indicators: Important field- and time-dependent factors to be considered , 1985, Scientometrics.

[16]  Ulrik Brandes,et al.  Network analysis of collaboration structure in Wikipedia , 2009, WWW '09.

[17]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[18]  Jöran Beel,et al.  The Impact of Demographics (Age and Gender) and Other User-Characteristics on Evaluating Recommender Systems , 2013, TPDL.

[19]  Belle L. Tseng,et al.  User reputation in a comment rating environment , 2011, KDD.

[20]  Volker Markl,et al.  Evaluating link-based recommendations for Wikipedia , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[21]  Cristina V. Lopes,et al.  Modeling user reputation in wikis , 2010, Stat. Anal. Data Min..

[22]  Volker Markl,et al.  Semantification of Identifiers in Mathematics for Better Math Information Retrieval , 2016, SIGIR.

[23]  Maribel Acosta,et al.  WikiWho: precise and efficient attribution of authorship of revisioned content , 2014, WWW.

[24]  H. Bradbury The SAGE Handbook of Action Research , 2007 .

[25]  Heng-Li Yang,et al.  The reasons why people continue editing Wikipedia content – task value confirmation perspective , 2014, Behav. Inf. Technol..

[26]  Brian S. Butler,et al.  Don't look now, but we've created a bureaucracy: the nature and roles of policies and rules in wikipedia , 2008, CHI.

[27]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[28]  Dariusz Jemielniak,et al.  Naturally Emerging Regulation and the Danger of Delegitimizing Conventional Leadership: Drawing on the Example of Wikipedia , 2015 .

[29]  References , 1971 .

[30]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[31]  Krishnendu Chatterjee,et al.  Assigning trust to Wikipedia content , 2008, Int. Sym. Wikis.