Detecting Short Passages of Similar Text in Large Document Collections

This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams, and measures to determine similarity are based on set theoretic principles. The system was developed using a corpus of broadcast news reports and has been successfully used to detect plagiarism in students’ work. It can find small sections that are similar as well as those that are identical. The method is very simple and effective, but seems not to have been used before

[1]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[4]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[5]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[6]  Steve Young,et al.  Corpus-based methods in language and speech processing , 1997 .

[7]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Robert Wilensky,et al.  Robust Hyperlinks: Cheap, Everywhere, Now , 2000, DDEP/PODDP.

[10]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[11]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[12]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[13]  Jan P. H. van Santen,et al.  Review of Handbook of standards and resources for spoken language systems by Dafydd Gibbon, Roger Moore, and Richard Winski. Mouton de Gruyter 1997. , 1998 .

[14]  Sven C. Martin,et al.  Statistical Language Modeling Using Leaving-One-Out , 1997 .

[15]  Ray J. Frank,et al.  Dynamic competitive learning applied to the clone detection problem , 1995 .

[16]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[17]  Roger K. Moore Computer Speech and Language , 1986 .