Detection of near-duplicate images for web search

Among the vast numbers of images on the web are many duplicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investigated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in fact be identified. In this paper we use popular queries and a commercial image search service to collect images that we then manually analyse for instances of near-duplication. We show that such duplication is indeed significant, but that not all kinds of image alteration explored in previous literature are evident in web data. Removal of near-duplicates from a collection is impractical, but we propose that they be removed from sets of answers. We evaluate our technique for automatic identification of near duplicates during query evaluation and show that it has promise as an effective mechanism for management of near-duplication in practice.

[1]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[2]  James Ze Wang,et al.  Content-based image indexing and searching using Daubechies' wavelets , 1998, International Journal on Digital Libraries.

[3]  Edward Y. Chang,et al.  RIME: a replicated image detector for the World Wide Web , 1998, Other Conferences.

[4]  Nicu Sebe,et al.  Multi-scale sub-image search , 1999, MULTIMEDIA '99.

[5]  Frank Hartung,et al.  Multimedia watermarking techniques , 1999, Proc. IEEE.

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Nazim Fatès,et al.  Watermarking scheme evaluation tool , 2000, Proceedings International Symposium on Multimedia Software Engineering.

[9]  Shih-Fu Chang,et al.  Duplicate detection in consumer photography and news video , 2002, MULTIMEDIA '02.

[10]  Yan Lin,et al.  A DWT-DFT composite watermarking scheme robust to both affine transform and JPEG compression , 2003, IEEE Trans. Circuits Syst. Video Technol..

[11]  Edward Y. Chang,et al.  Enhancing DPF for near-replica image recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[12]  Keiji Yanai,et al.  Generic image classification using visual knowledge on the web , 2003, ACM Multimedia.

[13]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[14]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[15]  Mark Sanderson,et al.  The SPIRIT collection: an overview of a large web collection , 2004, SIGF.

[16]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[17]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[18]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[19]  Edward Y. Chang,et al.  Enhanced perceptual distance functions and indexing for image replica recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Chun-Shien Lu,et al.  Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication , 2005, Multimedia Systems.

[21]  Trevor Darrell,et al.  Efficient image matching with distributions of local invariant features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Jun Jie Foo,et al.  Pruning SIFT for Scalable Near-duplicate Image Matching , 2007, ADC.

[24]  Justin Zobel,et al.  Discovery of Image Versions in Large Collections , 2007, MMM.