Scalar quantization for large scale image search

Bag-of-Words (BoW) model based on SIFT has been widely used in large scale image retrieval applications. Feature quantization plays a crucial role in BoW model, which generates visual words from the high dimensional SIFT features, so as to adapt to the inverted file structure for indexing. Traditional feature quantization approaches suffer several problems: 1) high computational cost---visual words generation (codebook construction) is time consuming especially with large amount of features; 2) limited reliability---different collections of images may produce totally different codebooks and quantization error is hard to be controlled; 3) update inefficiency--once the codebook is constructed, it is not easy to be updated. In this paper, a novel feature quantization algorithm, scalar quantization, is proposed. With scalar quantization, a SIFT feature is quantized to a descriptive and discriminative bit-vector, of which the first tens of bits are taken out as code word. Our quantizer is independent of collections of images. In addition, the result of scalar quantization naturally lends itself to adapt to the classic inverted file structure for image indexing. Moreover, the quantization error can be flexibly reduced and controlled by efficiently enumerating nearest neighbors of code words. The performance of scalar quantization has been evaluated in partial-duplicate Web image search on a database of one million images. Experiments reveal that the proposed scalar quantization achieves a relatively 42% improvement in mean average precision over the baseline (hierarchical visual vocabulary tree approach), and also outperforms the state-of-the-art Hamming Embedding approach and soft assignment method.

[1]  Gang Hua,et al.  Building contextual visual vocabulary for large-scale image applications , 2010, ACM Multimedia.

[2]  Patrick Gros,et al.  Asymmetric hamming embedding: taking the best of our bits for large scale image search , 2011, ACM Multimedia.

[3]  Gang Hua,et al.  Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications , 2011, IEEE Transactions on Image Processing.

[4]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[5]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[6]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[7]  Huizhong Chen,et al.  Residual Enhanced Visual Vectors for on-device image matching , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[8]  Qi Tian,et al.  Large scale image search with geometric coding , 2011, ACM Multimedia.

[9]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[10]  Michael Isard,et al.  Bundling features for large scale partial-duplicate web image search , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[14]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[15]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[16]  O. Chum,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Qi Tian,et al.  Spatial coding for large scale partial-duplicate web image search , 2010, ACM Multimedia.

[18]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[19]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[20]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Winston H. Hsu,et al.  Query expansion for hash-based image object retrieval , 2009, ACM Multimedia.

[22]  Richard Hartley,et al.  Localisation using an image-map , 2004 .

[23]  Cordelia Schmid,et al.  Vector Quantizing Feature Space with a Regular Lattice , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[25]  Vincent Lepetit,et al.  BRIEF: Binary Robust Independent Elementary Features , 2010, ECCV.

[26]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[27]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[28]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  正樹 高橋,et al.  ACM Multimedia 2011レポート , 2012 .

[31]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.