User Edits Classification Using Document Revision Histories

Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data.

[1]  T. Norberg Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Recommended Citation Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Multilingual Vandalism Detection Using Language-independent & Ex Post Facto Evidence Notebook for Pan at Clef , 2002 .

[2]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[3]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.

[4]  Evgeniy Gabrilovich,et al.  Using the past to score the present: extending term weighting models through revision history analysis , 2010, CIKM.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Elif Yamangil,et al.  Mining Wikipedia Revision Histories for Improving Sentence Compression , 2008, ACL.

[7]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[8]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[9]  Mark Dras,et al.  Choosing the Right Translation: A Syntactically Informed Classification Approach , 2008, COLING.

[10]  Andrew Hickl,et al.  Recognizing Textual Entailment with LCC’s G ROUNDHOG System , 2005 .

[11]  Cristina Ribeiro,et al.  Term weighting based on document revision history , 2011, J. Assoc. Inf. Sci. Technol..

[12]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[13]  Houda Bouamor,et al.  Local modifications and paraphrases in Wikipedia's revision history , 2011, Proces. del Leng. Natural.

[14]  Elif Yamangil,et al.  Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms , 2008 .

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[20]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[21]  Prodromos Malakasiotis,et al.  Paraphrase Recognition Using Machine Learning to Combine Similarity Measures , 2009, ACL.

[22]  Fabio Massimo Zanzotto,et al.  Expanding textual entailment corpora fromWikipedia using co-training , 2010, PWNLP@COLING.

[23]  Dana Shapira,et al.  Edit distance with move operations , 2002, J. Discrete Algorithms.

[24]  Insup Lee,et al.  Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[25]  Ani Nenkova,et al.  Predicting the Fluency of Text with Shallow Structural Features: Case Studies of Machine Translation and Human-Written Text , 2009, EACL.

[26]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[27]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[28]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[29]  Insup Lee,et al.  Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? , 2010, EUROSEC '10.