A Survey on Natural Language Text Copy Detection

Copy detection has very important application in both intellectual property protection and information retrieval. Currently, copy detection concentrates on document copy detection mainly. In early days, document copy detection concentrated on program plagiarism detection mainly and now the most studies are on text copy detection. In this paper, a comprehensive survey on natural language text copy detection is given, the developments of copy detection is introduced. The approaches and features of a variety of existing text copy detection systems or prototypes are reviewed in detail. Then some key detection techniques are listed and compared with each other. In the end, the future trend of text copy detection is discussed.

[1]  Arkady B. Zaslavsky,et al.  Suffix Vector: A Space-Efficient Suffix Tree Representation , 2001, ISAAC.

[2]  S. Singhe,et al.  Neural networks and disputed authorship: new challenges , 1995 .

[3]  Henning Schulzrinne,et al.  Copyright protection for electronic publishing over computer networks , 1995 .

[4]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[5]  Heinz Schmidt,et al.  Parallel and Distributed Overlap Detection on the Web , 2003 .

[6]  Arkady B. Zaslavsky,et al.  Parallel and Distributed Document Overlap Detection on the Web , 2000, PARA.

[7]  Lawrence O'Gorman,et al.  Electronic marking and identification techniques to discourage document copying , 1994, Proceedings of INFOCOM '94 Conference on Computer Communications.

[8]  Fintan Culwin,et al.  A REVIEW OF ELECTRONIC SERVICES FOR PLAGIARISM DETECTION IN STUDENT SUBMISSIONS , 2000 .

[9]  George,et al.  Computer Algorithms for Plagiarism Detection , 1989 .

[10]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[12]  Sethuraman Panchanathan,et al.  Review of Image and Video Indexing Techniques , 1997, J. Vis. Commun. Image Represent..

[13]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[14]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[15]  Michael J. Wise,et al.  Software for detecting suspected plagiarism: comparing structure and attribute-counting systems , 1996, ACSE '96.

[16]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[17]  Atsuo Yoshitaka,et al.  A Survey on Content-Based Retrieval for Multimedia Databases , 1999, IEEE Trans. Knowl. Data Eng..

[18]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[19]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[20]  Arkady B. Zaslavsky,et al.  Document overlap detection system for distributed digital libraries , 2000, DL '00.

[21]  Lu Han A REVIEW OF CONTENT BASED PARSING AND RETRIEVING FOR IMAGE AND VIDEO , 2001 .

[22]  Arkady B. Zaslavsky,et al.  MatchDetectReveal: finding overlapping and similar digital documents , 2000, IRMA Conference.

[23]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[24]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[25]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[26]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[27]  David Sharp,et al.  Technical Review of Plagiarism Detection Software Report , 2001 .

[28]  Miodrag Potkonjak,et al.  Copy detection for intellectual property protection of VLSI designs , 1999, 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051).

[29]  Gerald J. Popek,et al.  Encryption and Secure Computer Networks , 1979, CSUR.

[30]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[31]  Samuel L. Grier,et al.  A tool that detects plagiarism in Pascal programs , 1981, SIGCSE '81.

[32]  Heinz Schmidt,et al.  Parallel Overlap and Similarity Detection in Semi- Structured Document Collections , 2000 .

[33]  Luis Gravano,et al.  dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.