On Vocabulary Size in Bag-of-Visual-Words Representation

Bag-of-visual-words is a popular image representation that produces high matching accuracy and efficiency. While vocabulary size impacts on matching accuracy, existing research usually selects the vocabulary size empirically. Research on representative local descriptors shows that with similarity based clustering, the intra-cluster similarity extent of descriptors plays the same role in straightforward matching as vocabulary size in visual words matching. Based on this observation, we propose to use similarity based clustering to determine the optimal vocabulary size for a given dataset in visual words matching. Preliminary experiments with three datasets produce encouraging results and demonstrate the potential of the proposed approach.

[1]  Tao Mei,et al.  Learning Optimal Compact Codebook for Efficient Object Categorization , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[2]  Andrew Zisserman,et al.  An Affine Invariant Salient Region Detector , 2004, ECCV.

[3]  Bernt Schiele,et al.  Multiple Object Class Detection with a Generative Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Luc Van Gool,et al.  Wide Baseline Stereo Matching based on Local, Affinely Invariant Regions , 2000, BMVC.

[5]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[9]  Jiří Matas,et al.  Computer Vision - ECCV 2004 , 2004, Lecture Notes in Computer Science.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Richard Szeliski,et al.  Multi-image matching using multi-scale oriented patches , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[13]  Chong-Wah Ngo,et al.  Keyframe Retrieval by Keypoints: Can Point-to-Point Matching Help? , 2006, CIVR.

[14]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[15]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[16]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[18]  Roberto Cipolla,et al.  Computer Vision — ECCV '96 , 1996, Lecture Notes in Computer Science.

[19]  Luc Van Gool,et al.  Affine/ Photometric Invariants for Planar Intensity Patterns , 1996, ECCV.

[20]  Wei Zhang,et al.  Hierarchical building recognition , 2007, Image Vis. Comput..

[21]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Hermann Ney,et al.  Bag-of-visual-words models for adult image classification and filtering , 2008, 2008 19th International Conference on Pattern Recognition.

[23]  Cordelia Schmid,et al.  Selection of scale-invariant parts for object class recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  S. Lazebnik,et al.  Local Features and Kernels for Classification of Texture and Object Categories: An In-Depth Study , 2005 .

[26]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[27]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[29]  Matthieu Guillaumin,et al.  Combining Image-Level and Segment-Level Models for Automatic Annotation , 2012, MMM.

[30]  Naiming Qi,et al.  Image Matching Based on Representative Local Descriptors , 2010, MMM.

[31]  Hermann Ney,et al.  Discriminative training for object recognition using image patches , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).