Constrained keypoint quantization: towards better bag-of-words model for large-scale multimedia retrieval

Bag-of-words models are among the most widely used and successful representations in multimedia retrieval. However, the quantization error which is introduced when mapping keypoints to visual words is one of the main drawbacks of the bag-of-words model. Although some techniques, such as soft-assignment to bags [23] and query expansion [27], have been introduced to deal with the problem, the performance gain is always at the cost of longer query response time, which makes them difficult to apply to large-scale multimedia retrieval applications. In this paper, we propose a simple "constrained keypoint quantization" method which can effectively reduce the overall quantization error of the bag-of-words representation and greatly improve the retrieval efficiency at the same time. The central idea of the proposed quantization method is that if a keypoint is far away from all visual words, we simply remove it. At first glance, this simple strategy seems naive and dangerous. However, we show that the proposed method has a solid theoretical background. Our experimental results on three widely used datasets for near duplicate image and video retrieval confirm that by removing a large amount of keypoints which have high quantization error, we obtain comparable or even better retrieval performance while dramatically boosting retrieval efficiency.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Song Tan,et al.  Large-scale near-duplicate web video search: Challenge and opportunity , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[3]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[4]  Cordelia Schmid,et al.  Improving Bag-of-Features for Large Scale Image Search , 2010, International Journal of Computer Vision.

[5]  Rong Jin,et al.  An efficient key point quantization algorithm for large scale image retrieval , 2009, LS-MMRM '09.

[6]  Xian-Sheng Hua,et al.  Video-based image retrieval , 2011, MM '11.

[7]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[8]  Xian-Sheng Hua,et al.  Object Retrieval Using Visual Query Context , 2011, IEEE Transactions on Multimedia.

[9]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  Jean-Philippe Tarel,et al.  Non-Mercer Kernels for SVM Object Recognition , 2004, BMVC.

[12]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[14]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[15]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[19]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[20]  Siwei Lyu,et al.  Mercer kernels for object recognition with local features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Cordelia Schmid,et al.  Vector Quantizing Feature Space with a Regular Lattice , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[24]  Xian-Sheng Hua,et al.  Large-scale robust visual codebook construction , 2010, ACM Multimedia.

[25]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[27]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).