Automatic Generation of Summary Obfuscation Corpus for Plagiarism Detection

In this paper, we describe an approach to create a summary obfuscation corpus for the task of plagiarism detection. Our method is based on information from the Document Understanding Conferences related to years 2001 and 2006, for the English language. Overall, an unattributed summary used within someone else’s document is considered a kind of plagiarism because the main author’s ideas are still in a succinct form. In order to create the corpus, we use a Named Entity Recognizer (NER) to identify the entities within an original document, its associated summaries, and target documents. After, these entities, together with similar paragraphs in target documents, are used to make fake suspicious documents and plagiarized documents. The corpus was tested in plagiarism competition.

[1]  Alexander F. Gelbukh,et al.  Conceptual Graphs as Framework for Summarizing Short Texts , 2014, Int. J. Concept. Struct. Smart Appl..

[2]  George Giannakopoulos,et al.  Multi-document multilingual summarization and evaluation tracks in ACL 2013 MultiLing Workshop , 2013 .

[3]  Diego Antonio Rodríguez Torrejón,et al.  Text Alignment Module in CoReMo 2.1 Plagiarism Detector Notebook for PAN at CLEF 2013 , 2013, CLEF.

[4]  Mingxing Wang,et al.  Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013 , 2013, CLEF.

[5]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[6]  Salar Mohtaj,et al.  Developing Monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus , 2015, CLEF.

[7]  V. Thada,et al.  Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm , 2013 .

[8]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[11]  Vadim V. Strijov,et al.  Methods for Intrinsic Plagiarism Detection and Author Diarization , 2016, CLEF.

[12]  George Giannakopoulos,et al.  Multi-document Multilingual Summarization Corpus Preparation, Part 2: Czech, Hebrew and Spanish Multi-document Multilingual Summarization and Evaluation Tracks in Acl 2013 Multiling Workshop Acl 2013 Multiling Pilot Overview Cist System Report for Acl Multiling 2013 – Track 1: Multilingual Multi-doc , 2013 .

[13]  Udo Kruschwitz,et al.  MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations , 2015, SIGDIAL Conference.

[14]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[15]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[16]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[17]  Lee Gillam,et al.  Guess Again and See if They Line up: Surrey's Runs at Plagiarism Detection Notebook for PAN at CLEF 2013 , 2013, CLEF.

[18]  Simon Suchomel,et al.  Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013 , 2013, CLEF.

[19]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[20]  Taher Rahgooy,et al.  Author Obfuscation using WordNet and Language Models , 2016, CLEF.

[21]  Thamar Solorio,et al.  Using a Variety of n-Grams for the Detection of Different Kinds of Plagiarism Notebook for PAN at CLEF 2013 , 2013, CLEF.