论文信息 - Using Hybrid Similarity Methods for Plagiarism Detection Notebook for PAN at CLEF 2013

Using Hybrid Similarity Methods for Plagiarism Detection Notebook for PAN at CLEF 2013

At PAN2013 we decided to focus entirely on Text Alignment subtask. Following our previous experience at PAN2012 and CLINSS2012, we decided to put together the approaches we used in previous year to face the new challenges of PAN2013. This year competition added new way of plagiarism obfuscation via text summarization. This particular feature required represents a wide variety of typical cases of plagiarism in the wild and thus attracted our scientific interest. At this year PAN we put forward two main goals: 1) to develop a unified approach that will allow us to merge results obtained by different analysis methods and then run a unified clusterization algorithm to tackle the problem of granularity and produce clean clusters of suspected plagiarism 2) develop a new method of detecting summarization within the suspected documents. As a starting point at PAN 2013 we utilized the prototype application we developed for PAN 2012 and another application developed for FIRE 2012 (CLINSS task). Two basic approaches are fingerprinting via 5gramm hashes with variable step as our main method and sliding window TFIDF weighting score for similarity detection of pre-processed summarization via custom text summarizer. Euclidian distance based clusterization with additional custom filters method was used as our cluster merging technique. During the training stage we used the PAN 2012\PAN2013 provided data and performance measures scripts incorporated with genetic algorithm for best parameter tuning and overall performance. Hardware used (training\ development): 6-core Intel i7990Ex with 6GB RAM PC, Vertex3 SSD. Software used: Windows 7 x64, Visual Studio 2010, .net framework, C#, vb.net. We obtained the 6th overall score at PAN2013 with final p-det 0,6152.

Yurii Palkovskii | Alexei Belov | A. Belov | Yurii Palkovskii

[1] Cristian Grozea,et al. Who's the Thief? Automatic Detection of the Direction of Plagiarism , 2010, CICLing.

[2] Paul Clough,et al. Old and new challenges in automatic plagiarism detection , 2003 .

[3] Cristian Grozea,et al. Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[4] Cristian Grozea,et al. The Encoplot Similarity Measure for Automatic Detection of Plagiarism - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[5] Matthias Hagen,et al. Overview of the 1st international competition on plagiarism detection , 2009 .

[6] Martin Braschler,et al. CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy , 2010, CLEF.

[7] Eduard Ayguadé,et al. Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[8] Benno Stein,et al. An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[9] Cristian Grozea,et al. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .