论文信息 - The detection of duplicates in document image databases

The detection of duplicates in document image databases

We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.

David S. Doermann | Huiping Li | Omid E. Kia

[1] James O. Hamblen,et al. Computer algorithms for plagiarism detection , 1989 .

[2] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[3] Jonathan J. Hull. Document Image Matching and Retrieval With Multiple Distortion-Invariant Descriptors , 1995 .

[4] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[5] T. Yan. Duplicate Detection in Information Dissemination , 1995 .

[6] Eiichi Tanaka,et al. High speed string edit methods using hierarchical files and hashing technique , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[7] Jonathan J. Hull. Document matching on CCITT Group 4 compressed images , 1997, Electronic Imaging.

[8] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.