Beyond “Near Duplicates”: Learning Hash Codes for Efficient Similar-Image Retrieval

Finding similar images in a large database is an important, but often computationally expensive, task. In this paper, we present a two-tier similar-image retrieval system with the efficiency characteristics found in simpler systems designed to recognize near-duplicates. We compare the efficiency of lookups based on random projections and learned hashes to 100-times-more-frequent exemplar sampling. Both approaches significantly improve on the results from exemplar sampling, despite having significantly lower computational costs. Learned-hash keys provide the best result, in terms of both recall and efficiency.

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[3]  H. W. Kuhn B R Y N Mawr College Variants of the Hungarian Method for Assignment Problems' , 1955 .

[4]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[5]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[6]  Shumeet Baluja,et al.  Learning to hash: forgiving hash functions and applications , 2008, Data Mining and Knowledge Discovery.

[7]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[8]  Shumeet Baluja,et al.  Finding Images and Line-Drawings in Document-Scanning Systems , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Shumeet Baluja,et al.  Known-Audio Detection using Waveprint: Spectrogram Fingerprinting by Wavelet Hashing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.