Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.

[1]  R. Stuart Geiger,et al.  The work of sustaining order in wikipedia: the banning of a vandal , 2010, CSCW '10.

[2]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[3]  Martin Potthast,et al.  Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[4]  Paolo Rosso,et al.  Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features , 2011, CICLing.

[5]  Padmini Srinivasan,et al.  Detecting Wikipedia vandalism with active learning and statistical language models , 2010, WICOW '10.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  John Riedl,et al.  Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[8]  Santiago Moisés Mola-Velasco,et al.  Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals - Lab Report for PAN at CLEF 2010 , 2012, CLEF.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[12]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[13]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[14]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[15]  Pierre Baldi,et al.  Mining and tracking evolving web user trends from large web server logs , 2010 .

[16]  Luca de Alfaro,et al.  Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[17]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[18]  Charles L. A. Clarke,et al.  Using dynamic markov compression to detect vandalism in the wikipedia , 2009, SIGIR.

[19]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[20]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.