Non-relevant document reduction in anti-plagiarism using asymmetric similarity and AVL tree index

Anti-plagiarism applications have been developed using various approaches. Many methods compare one document to others, regardless of their relevance. This paper proposes a method to reduce non-relevant documents (those having no similar topic with query document) by using asymmetric similarity. Whole documents are collected in one corpus. Each document is preprocessed using winnowing algorithm. The feature from winnowing is then indexed using AVL Tree algorithm to fasten document comparing process. The result shows that reducing non-relevant document shortens almost 10 times of the processing time compared to non-reduced process. Meanwhile, both processes show the same accuracy of 89.78% to give suspected documents.

[1]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[2]  C. C. Foster Information retrieval: information storage and retrieval using AVL trees , 1965, ACM '65.

[3]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[4]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[6]  Tim Penyusun Kamus Pusat Pembinaan Dan Pengembangan Bahasa Kamus Besar Bahasa Indonesia , 2005 .

[7]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[8]  Robert W. Irving,et al.  The suffix binary search tree and suffix AVL tree , 2003, J. Discrete Algorithms.

[9]  Rodrigo Alexander Castro Campos,et al.  Batch source-code plagiarism detection using an algorithm for the bounded longest common subsequence problem , 2012, 2012 9th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE).

[10]  Agung Toto Wibowo,et al.  Comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on Bahasa Indonesia documents , 2013, 2013 International Conference of Information and Communication Technology (ICoICT).

[11]  Costas S. Iliopoulos,et al.  A New Efficient Algorithm for Computing the Longest Common Subsequence , 2008, Theory of Computing Systems.

[12]  F. Tala A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia , 2003 .

[13]  Costas S. Iliopoulos,et al.  A New Efficient Algorithm for Computing the Longest Common Subsequence , 2007, AAIM.

[14]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[15]  Catur Supriyanto,et al.  A Comparison of Rabin Karp and Semantic-Based Plagiarism Detection , 2015 .