Plagiarism and Collusion Detection using the Smith-Waterman Algorithm

We investigate the use of variants of the Smith-Waterman algorithm to locate similarities in texts and in program source code, with a view to their application in the detection of plagiarism and collusion. The Smith-Waterman algorithm is a classical tool in the identification and quantification of local similarities in biological sequences, but we demonstrate that somewhat different issues arise in this different context, and that these factors can be exploited to yield significant speed-up in practice. We include empirical evidence to indicate the practicality of the approach and to illustrate the efficiency gains.