论文信息 - Wiki Vandalysis - Wikipedia Vandalism Analysis - Lab Report for PAN at CLEF 2010

Wiki Vandalysis - Wikipedia Vandalism Analysis - Lab Report for PAN at CLEF 2010

Wikipedia describes itself as the “free encyclopedia that anyone can edit”. Along with the helpful volunteers who contribute by improving the articles, a great number of malicious users abuse the open nature of Wikipedia by vandalizing articles. Deterring and reverting vandalism has become one of the major challenges of Wikipedia as its size grows. Wikipedia editors fight vandalism both manually and with automated bots that use regular expressions and other simple rules to recognize malicious edits[5]. Researchers have also proposed Machine Learning algorithms for vandalism detection[19,15], but these algorithms are still in their infancy and have much room for improvement. This paper presents an approach to fighting vandalism by extracting various features from the edits for machine learning classification. Our classifier uses information about the editor, the sentiment of the edit, the “quality” of the edit (i.e. spelling errors), and targeted regular expressions to capture patterns common in blatant vandalism, such as insertion of obscene words or multiple exclamations. We have successfully been able to achieve an area under the ROC curve (AUC) of 0.91 on a training set of 15000 human annotated edits and 0.887 on a random sample of 17472 edits from 317443.

[1] R. Alston. The English dictionary , 1966 .

[2] John D. Lafferty,et al. A Robust Parsing Algorithm for Link Grammars , 1995, IWPT.

[3] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[4] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[5] Ron Kohavi,et al. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[6] Insup Lee,et al. Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? , 2010, EUROSEC '10.

[7] Yoram Singer,et al. An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[8] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[9] Martin Potthast,et al. Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[10] Benno Stein,et al. Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[11] John Riedl,et al. Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[12] Charles L. A. Clarke,et al. Using dynamic markov compression to detect vandalism in the wikipedia , 2009, SIGIR.

[13] Bart Goethals,et al. Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .