Efficient Kernels Couple Visual Words Through Categorical Opponency

Recent progress has been made on sparse dictionaries for the Bag-of-Visual-Words (BOVW) approach to object recognition and scene categorization. In particular, jointly encoded words have been shown to greatly enhance retrieval and categorization performance by both improving dictionary sparsity, which impacts efficiency of retrieval, and improving the selectivity of categorization. In this paper, we suggest and evaluate different functions for the “soft-pairing” of words, whereby the likelihood of pairing is influenced by proximity and scale of putative word pairs. The methods are evaluated in both the Caltech-101 database and the Pascal VOC 2007 and 2011 databases. The results are compared against spatial pyramids using BOVW descriptions, standard BOVW approaches, and across different parameter values of pairing functions. We also compare dense and keypoint-based approaches in this context. One conclusion is that word pairing provides a means towards attaining the performance of much larger dictionary sizes without the computational effort of clustering. This lends it to situations where the dictionaries must be frequently relearned, or where image statistics frequently change.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Martial Hebert,et al.  Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[5]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[6]  Stefano Soatto,et al.  Proximity Distribution Kernels for Geometric Context in Category Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7]  Tsuhan Chen,et al.  Efficient Kernels for identifying unbounded-order spatial features , 2009, CVPR.

[8]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[9]  Tsuhan Chen,et al.  Image retrieval with geometry-preserving visual phrases , 2011, CVPR 2011.

[10]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[11]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Daniel P. Huttenlocher,et al.  Spatial priors for part-based recognition using statistical models , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[17]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.