Semantic Document Distance Measures and Unsupervised Document Revision Detection

In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps this https URL and simulated data sets.

[1]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[2]  Jure Leskovec,et al.  Governance in Social Media: A Case Study of the Wikipedia Promotion Process , 2010, ICWSM.

[3]  Alan Bundy,et al.  Dynamic Time Warping , 1984 .

[4]  Stefano Faralli,et al.  OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction , 2013, CL.

[5]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[6]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[7]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[8]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[9]  Helena Gómez-Adorno,et al.  Computing text similarity using Tree Edit Distance , 2015, 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC).

[10]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[11]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[12]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[13]  Nikolaus Augsten,et al.  A Memory-Efficient Tree Edit Distance Algorithm , 2014, DEXA.

[14]  Christos Faloutsos,et al.  FTW: fast similarity search under the time warping distance , 2005, PODS.

[15]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[16]  Stefan Conrad,et al.  Measuring text similarity with dynamic time warping , 2008, IDEAS '08.

[17]  Tetsuya Ishikawa,et al.  Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration , 2001, Comput. Humanit..

[18]  Nikolaus Augsten,et al.  Efficient Computation of the Tree Edit Distance , 2015, TODS.

[19]  Xiaoying Liu,et al.  Sentence Similarity based on Dynamic Time Warping , 2007, International Conference on Semantic Computing (ICSC 2007).

[20]  Jing Zhang,et al.  Revision provenance in text documents of asynchronous collaboration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[22]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[23]  Juan D. Velásquez,et al.  Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style , 2013, Expert Syst. Appl..

[24]  Mikalai Sabel Structuring wiki revision history , 2007, WikiSym '07.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[27]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[28]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[29]  Rasim M. Alguliyev,et al.  PDLK: Plagiarism detection using linguistic knowledge , 2015, Expert Syst. Appl..

[30]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[31]  Maria Soledad Pera,et al.  Nowhere to Hide: Finding Plagiarized Documents Based on Sentence Similarity , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[32]  Matthias Hagen,et al.  Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches , 2015, CLEF.

[33]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[34]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.