论文信息 - External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010

External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010

We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages. Our external plagia- rism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive over- lapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before enter- ing the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been parti- tioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.

[1] Roman Kern,et al. Efficient linear text segmentation based on information retrieval techniques , 2009, MEDES.

[2] Efstathios Stamatatos,et al. Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[3] Cristian Grozea,et al. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[4] Ben Taskar,et al. Alignment by Agreement , 2006, NAACL.

[5] J. Maizel,et al. Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[6] Benno Stein,et al. Intrinsic Plagiarism Detection , 2006, ECIR.

[7] James A. Malcolm,et al. A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector , 2004 .

[8] Hermann A. Maurer,et al. Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[9] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.