A Comparison of Models over Wikipedia Articles Revisions

Measuring the similarity of texts is a common task in detection of co-derivatives, plagiarism and information flow. In general the objective is to locate those fragments of a document that are derived from another text. We have carried out an exhaustive comparison of similarity estimation models in order to determine which one performs better on different levels of granularity and languages (English, German, Spanish, and Hindi). In connection with the comparison we introduce a publicly available corpus specially suited for this task. Furthermore we introduce some modifications to well known algorithms in order to demonstrate their applicability to this task. Amongst others, our experiments show the strengths and weaknesses of the different models with respect to the granularity of the processed texts.

[1]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[2]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[3]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[4]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[5]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[6]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[7]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[8]  Benno Stein,et al.  New Issues in Near-duplicate Detection , 2007, GfKl.

[9]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[10]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[11]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[12]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[13]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[14]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[15]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[16]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[17]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[18]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[19]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[20]  Paolo Rosso,et al.  On the relevance of search space reduction in automatic plagiarism detection , 2009 .

[21]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[22]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[23]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..