External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010

We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages. Our external plagia- rism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive over- lapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before enter- ing the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been parti- tioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.