A Hybrid Architecture for Plagiarism Detection

We present a hybrid plagiarism detection architecture that operates on the two principal forms of text plagiarism. For order-preserving plagiarism, such as paraphrasing and modified cut-and-paste, it contains a text alignment component that is robust against word choice and phrasing changes that do not alter the basic ordering. And for non-order based plagiarism, such as random phrase reordering and summarization, it contains a two-stage cluster detection component. The first stage identifies a maximal passage in the suspect document that is related to the source document, while the second stage determines whether the suspect passage corresponds to the entire source document or just to a passage within it. Three implementations of this architecture, involving a common text alignment component and three different cluster detection components, participated in the PAN 2014 Text Alignment task and performed very well, achieving very high precision, recall, and overall plagiarism detection scores.