Using Hybrid Similarity Methods for Plagiarism Detection Notebook for PAN at CLEF 2013

At PAN2013 we decided to focus entirely on Text Alignment subtask. Following our previous experience at PAN2012 and CLINSS2012, we decided to put together the approaches we used in previous year to face the new challenges of PAN2013. This year competition added new way of plagiarism obfuscation via text summarization. This particular feature required represents a wide variety of typical cases of plagiarism in the wild and thus attracted our scientific interest. At this year PAN we put forward two main goals: 1) to develop a unified approach that will allow us to merge results obtained by different analysis methods and then run a unified clusterization algorithm to tackle the problem of granularity and produce clean clusters of suspected plagiarism 2) develop a new method of detecting summarization within the suspected documents. As a starting point at PAN 2013 we utilized the prototype application we developed for PAN 2012 and another application developed for FIRE 2012 (CLINSS task). Two basic approaches are fingerprinting via 5gramm hashes with variable step as our main method and sliding window TFIDF weighting score for similarity detection of pre-processed summarization via custom text summarizer. Euclidian distance based clusterization with additional custom filters method was used as our cluster merging technique. During the training stage we used the PAN 2012\PAN2013 provided data and performance measures scripts incorporated with genetic algorithm for best parameter tuning and overall performance. Hardware used (training\ development): 6-core Intel i7990Ex with 6GB RAM PC, Vertex3 SSD. Software used: Windows 7 x64, Visual Studio 2010, .net framework, C#, vb.net. We obtained the 6th overall score at PAN2013 with final p-det 0,6152.