Plagiarism Alignment Detection by Merging Context Seeds Notebook for PAN at CLEF 2014

We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative singlelinkage strategy using a defined distance measure.