Discovery of Collocation Patterns: from Visual Words to Visual Phrases

A visual word lexicon can be constructed by clustering primitive visual features, and a visual object can be described by a set of visual words. Such a "bag-of-words" representation has led to many significant results in various vision tasks including object recognition and categorization. However, in practice, the clustering of primitive visual features tends to result in synonymous visual words that over-represent visual patterns, as well as polysemous visual words that bring large uncertainties and ambiguities in the representation. This paper aims at generating a higher-level lexicon, i.e. visual phrase lexicon, where a visual phrase is a meaningful spatially co-occurrent pattern of visual words. This higher-level lexicon is much less ambiguous than the lower-level one. The contributions of this paper include: (1) a fast and principled solution to the discovery of significant spatial co-occurrent patterns using frequent itemset mining; (2) a pattern summarization method that deals with the compositional uncertainties in visual phrases; and (3) a top-down refinement scheme of the visual word lexicon by feeding back discovered phrases to tune the similarity measure through metric learning.

[1]  Sanja Fidler,et al.  Hierarchical Statistical Learning of Generic Parts of Object Structure , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Cordelia Schmid,et al.  A Discriminative Framework for Texture and Object Recognition Using Local Image Features , 2006, Toward Category-Level Object Recognition.

[4]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Song-Chun Zhu,et al.  What are Textons? , 2005 .

[6]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[7]  Luc Van Gool,et al.  Video mining with frequent itemset configurations , 2006 .

[8]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Hui Xiong,et al.  Discovering colocation patterns from spatial data sets: a general approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[12]  Song-Chun Zhu,et al.  What are Textons? , 2005, Int. J. Comput. Vis..

[13]  Gösta Grahne,et al.  Fast algorithms for frequent itemset mining using FP-trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Narendra Ahuja,et al.  Extracting Subimages of an Unknown Category from a Set of Images , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Jiawei Han,et al.  Generating semantic annotations for frequent patterns with context analysis , 2006, KDD '06.

[16]  Gang Wang,et al.  Using Dependent Regions for Object Categorization in a Generative Framework , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Sven J. Dickinson,et al.  Using Language to Drive the Perceptual Grouping of Local Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Ankur Agarwal,et al.  Hyperfeatures - Multilevel Local Coding for Visual Recognition , 2006, ECCV.

[19]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[22]  Gustavo Carneiro,et al.  Sparse Flexible Models of Local Features , 2006, ECCV.

[23]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[24]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[25]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.