Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

The task of monolingual text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask of the plagiarism detection competition at PAN 2014, which resulted in the best-performing system at the PAN 2014 competition and outperforms the best-performing system of the PAN 2013 competition by the cumulative evaluation measure Plagdet. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to consider stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the ranges of matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. Our system is available as open source.

[1]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[2]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[3]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[4]  Stefan Conrad,et al.  A Set-Based Approach to Plagiarism Detection , 2012, CLEF.

[5]  Iryna Gurevych,et al.  Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.

[6]  Yurii Palkovskii,et al.  Using Hybrid Similarity Methods for Plagiarism Detection Notebook for PAN at CLEF 2013 , 2013, CLEF.

[7]  Lee Gillam,et al.  Guess Again and See if They Line up: Surrey's Runs at Plagiarism Detection Notebook for PAN at CLEF 2013 , 2013, CLEF.

[8]  Mingxing Wang,et al.  Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013 , 2013, CLEF.

[9]  Thamar Solorio,et al.  Using a Variety of n-Grams for the Detection of Different Kinds of Plagiarism Notebook for PAN at CLEF 2013 , 2013, CLEF.

[10]  Diego Antonio Rodríguez Torrejón,et al.  Text Alignment Module in CoReMo 2.1 Plagiarism Detector Notebook for PAN at CLEF 2013 , 2013, CLEF.

[11]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[12]  Simon Suchomel,et al.  Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013 , 2013, CLEF.

[13]  Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013 , 2014, CLEF.

[14]  Alexander F. Gelbukh,et al.  Dependency-Based Semantic Parsing for Concept-Level Text Analysis , 2014, CICLing.

[15]  Wessel Kraaij,et al.  Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014 , 2014, CLEF.

[16]  Chen Gui,et al.  A Rule-Based Approach to Aspect Extraction from Product Reviews , 2014, SocialNLP@COLING.

[17]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[18]  Erik Cambria,et al.  Sentic patterns: Dependency-based rules for concept-level sentiment analysis , 2014, Knowl. Based Syst..