Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexico-syntactic patterns based on n-grams.

[1]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[2]  Petra Perner,et al.  Multi-interval Discretization Methods for Decision Tree Learning , 1998, SSPR/SPR.

[3]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[4]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[5]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[6]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[7]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[8]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[11]  Rob Johnson,et al.  Wiki Vandalysis - Wikipedia Vandalism Analysis - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[12]  Adriana Kovashka,et al.  Authorship Attribution Using Probabilistic Context-Free Grammars , 2010, ACL.

[13]  R. Stuart Geiger,et al.  The work of sustaining order in wikipedia: the banning of a vandal , 2010, CSCW '10.

[14]  Martin Potthast,et al.  Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[15]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[16]  Insup Lee,et al.  Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? , 2010, EUROSEC '10.

[17]  Paolo Rosso,et al.  Personal Sense and Idiolect: Combining Authorship Attribution and Opinion Analysis , 2010, LREC.

[18]  William Yang Wang,et al.  “Got You!”: Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling , 2010, COLING.

[19]  Padmini Srinivasan,et al.  Detecting Wikipedia vandalism with active learning and statistical language models , 2010, WICOW '10.

[20]  Paolo Rosso,et al.  Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features , 2011, CICLing.

[21]  M. T. Turell The use of textual, grammatical and sociolinguistic evidence in forensic text comparison: , 2011 .

[22]  Pedro F. Miret,et al.  Wikipedia , 2008, Monatsschrift für Deutsches Recht.