Current research in the field of automatic plagiarism detection for text documents focuses on algorithms that compare plagiarized documents against potential original documents. Though these approaches perform well in identifying copied or even modified passages, they assume a closed world: a reference collection must be given against which a plagiarized document can be compared.
This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism detection. The paper is devoted to this problem class; it shows that it is possible to identify potentially plagiarized passages by analyzing a single document with respect to variations in writing style.
Our contributions are fourfold: (i) a taxonomy of plagiarism delicts along with detection methods, (ii) new features for the quantification of style aspects, (iii) a publicly available plagiarism corpus for benchmark comparisons, and (iv) promising results in non-trivial plagiarism detection settings: in our experiments we achieved recall values of 85% with a precision of 75% and better.
[1]
Benno Stein,et al.
Plagiarism Detection Without Reference Collections
,
2006,
GfKl.
[2]
Benno Stein,et al.
Fuzzy-Fingerprints for Text-Based Information Retrieval
,
2005
.
[3]
Günther Palm,et al.
KI 2004: Advances in Artificial Intelligence
,
2004,
Lecture Notes in Computer Science.
[4]
Geoffrey Leech,et al.
Corpus Annotation: Linguistic Information from Computer Text Corpora
,
1997
.
[5]
Justin Zobel,et al.
Methods for Identifying Versioned and Plagiarized Documents
,
2003,
J. Assoc. Inf. Sci. Technol..
[6]
Sven Meyer.
Genre Classification of Web Pages User Study and Feasibility Analysis
,
2004
.
[7]
T. Allen.
Thank you.
,
2003,
CJEM.
[8]
Benno Stein,et al.
Genre Classification of Web Pages
,
2004,
KI.
[9]
Benno Stein,et al.
Near Similarity Search and Plagiarism Analysis
,
2005,
GfKl.
[10]
Hector Garcia-Molina,et al.
Copy detection mechanisms for digital documents
,
1995,
SIGMOD '95.
[11]
Moshe Koppel,et al.
Authorship verification as a one-class classification problem
,
2004,
ICML.