MapReduce-based clustering for near-duplicate image identification

In this paper, an effective algorithm is developed for tackling the problem of near-duplicate image identification from large-scale image sets, where the LLC (locality-constrained linear coding) method is seamlessly integrated with the maxIDF cut model to achieve more discriminative representations of images. By incorporating MapReduce framework for image clustering and pairwise merging, the near duplicates of images can be identified effectively from large-scale image sets. An intuitive strategy is also introduced to guide the process for parameter selection. Our experimental results on large-scale image sets have revealed that our algorithm can achieve significant improvement on both the accuracy rates and the computation efficiency as compared with other baseline methods.

[1]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[2]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[3]  Winston H. Hsu,et al.  Two-stage sparse graph construction using MinHash on MapReduce , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[5]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Zhe Wang,et al.  High-confidence near-duplicate image detection , 2012, ICMR.

[7]  Ce Liu,et al.  Duplicate Discovery on 2 Billion Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[9]  Lei Wang,et al.  GPU-based MapReduce for large-scale near-duplicate video retrieval , 2015, Multimedia Tools and Applications.

[10]  Vassilios Morellas,et al.  Robust Sparse Hashing , 2012, 2012 19th IEEE International Conference on Image Processing.

[11]  Jianping Fan,et al.  Image collection summarization via dictionary learning for sparse representation , 2013, Pattern Recognit..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Qi Tian,et al.  Lp-Norm IDF for Large Scale Image Search , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[15]  Michael Isard,et al.  Partition Min-Hash for Partial Duplicate Image Discovery , 2010, ECCV.

[16]  Stefan Winkler,et al.  PhotoCluster a multi-clustering technique for near-duplicate detection in personal photo collections , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[17]  Shih-Fu Chang,et al.  Semi-Supervised Hashing for Large-Scale Search , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[19]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[20]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[21]  Winston H. Hsu,et al.  Online image search result grouping with MapReduce-based image clustering and graph construction for large-scale photos , 2014, J. Vis. Commun. Image Represent..

[22]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Yi Shen,et al.  Cross-modal social image clustering and tag cleansing , 2013, J. Vis. Commun. Image Represent..

[24]  Qi Tian,et al.  Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb , 2014, Comput. Vis. Image Underst..

[25]  Justin Zobel,et al.  Clustering near-duplicate images in large collections , 2007, MIR '07.

[26]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[27]  Jiri Matas,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, CVPR.

[28]  Lei Zhang,et al.  Near Duplicate Image Discovery on One Billion Images , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[29]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.