Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia

Vandalism, the malicious modification of articles, is a serious problem for open access encyclopedias such as Wikipedia. The use of counter-vandalism bots is changing the way Wikipedia identifies and bans vandals, but their contributions are often not considered nor discussed. In this paper, we propose novel text features capturing the invariants of vandalism across five languages to learn and compare the contributions of bots and users in the task of identifying vandalism. We construct computationally efficient features that highlight the contributions of bots and users, and generalize across languages. We evaluate our proposed features through classification performance on revisions of five Wikipedia languages, totaling over 500 million revisions of over nine million articles. As a comparison, we evaluate these features on the small PAN Wikipedia vandalism data sets, used by previous research, which contain approximately 62,000 revisions. We show differences in the performance of our features on the PAN and the full Wikipedia data set. With the appropriate text features, vandalism bots can be effective across different languages while learning from only one language. Our ultimate aim is to build the next generation of vandalism detection bots based on machine learning approaches that can work effectively across many languages.

[1]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[2]  William Yang Wang,et al.  “Got You!”: Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling , 2010, COLING.

[3]  Aniket Kittur,et al.  Learning from history: predicting reverted work at the word level in wikipedia , 2012, CSCW '12.

[4]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[5]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[6]  William Nick Street,et al.  Divide and Transfer: an Exploration of Segmented Transfer to Detect Wikipedia Vandalism , 2012, ICML Unsupervised and Transfer Learning.

[7]  Aaron Halfaker,et al.  Snuggle: designing for efficient socialization and ideological critique , 2014, CHI.

[8]  Aaron Halfaker,et al.  Bots and Cyborgs: Wikipedia's Immune System , 2012, Computer.

[9]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[10]  Padmini Srinivasan,et al.  Detecting Wikipedia vandalism with active learning and statistical language models , 2010, WICOW '10.

[11]  Jun Huan,et al.  Knowledge Transfer with Low-Quality Data: A Feature Extraction Issue , 2012, IEEE Trans. Knowl. Data Eng..

[12]  Calton Pu,et al.  Elusive vandalism detection in wikipedia: a text stability-based approach , 2010, CIKM.

[13]  Luca de Alfaro,et al.  Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[14]  Jussara M. Almeida,et al.  Automatic Vandalism Detection in Wikipedia with Active Associative Classification , 2012, TPDL.

[15]  Martin Potthast,et al.  Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[16]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[17]  Charles L. A. Clarke,et al.  Using dynamic markov compression to detect vandalism in the wikipedia , 2009, SIGIR.

[18]  Aaron Halfaker,et al.  Don't bite the newbies: how reverts affect the quantity and quality of Wikipedia work , 2011, Int. Sym. Wikis.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.

[21]  Insup Lee,et al.  Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[22]  Nathaniel Tkacz,et al.  Critical point of view: a Wikipedia reader , 2012 .

[23]  John Riedl,et al.  Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[24]  Peter Christen,et al.  Cross Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions , 2013, PAKDD.

[25]  Insup Lee,et al.  Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? , 2010, EUROSEC '10.

[26]  Yejin Choi,et al.  Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis , 2011, ACL.

[27]  Cristina V. Lopes,et al.  Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso , 2011, Int. Sym. Wikis.

[28]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[29]  Aaron Halfaker,et al.  When the levee breaks: without bots, what happens to Wikipedia's quality control processes? , 2013, OpenSym.

[30]  T. Norberg Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Recommended Citation Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Notebook for Pan at Clef , 2002 .

[31]  Jun Huan,et al.  Knowledge Transfer with Low-Quality Data: A Feature Extraction Issue , 2011, IEEE Transactions on Knowledge and Data Engineering.

[32]  Paolo Rosso,et al.  Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features , 2011, CICLing.

[33]  R. Stuart Geiger,et al.  The work of sustaining order in wikipedia: the banning of a vandal , 2010, CSCW '10.

[34]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[35]  Santiago Moisés Mola-Velasco,et al.  Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals - Lab Report for PAN at CLEF 2010 , 2012, CLEF.

[36]  Luca de Alfaro,et al.  Measuring author contributions to the Wikipedia , 2008, Int. Sym. Wikis.