Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback-Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n -grams.

[1]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[2]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[3]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[4]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[5]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[8]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[9]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[10]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[11]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[12]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[13]  Robert J. Gaizauskas,et al.  Building and annotating a corpus for the study of journalistic text reuse , 2002, LREC.

[14]  James A. Malcolm,et al.  A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector , 2004 .

[15]  Minh N. Do,et al.  Texture similarity measurement using Kullback-Leibler distance on wavelet subbands , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[16]  Paolo Rosso,et al.  Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance , 2009, CICLing.

[17]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.