Visual Dictionary Learning for Joint Object Categorization and Segmentation

Representing objects using elements from a visual dictionary is widely used in object detection and categorization. Prior work on dictionary learning has shown improvements in the accuracy of object detection and categorization by learning discriminative dictionaries. However none of these dictionaries are learnt for joint object categorization and segmentation. Moreover, dictionary learning is often done separately from classifier training, which reduces the discriminative power of the model. In this paper, we formulate the semantic segmentation problem as a joint categorization, segmentation and dictionary learning problem. To that end, we propose a latent conditional random field (CRF) model in which the observed variables are pixel category labels and the latent variables are visual word assignments. The CRF energy consists of a bottom-up segmentation cost, a top-down bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, relationships between visual words and object categories, and spatial relationships among visual words. The segmentation, categorization, and dictionary learning parameters are learnt jointly using latent structural SVMs, and the segmentation and visual words assignments are inferred jointly using energy minimization techniques. Experiments on the Graz02 and CamVid datasets demonstrate the performance of our approach.

[1]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  René Vidal,et al.  Using global bag of features models in random fields for joint categorization and segmentation of objects , 2011, CVPR 2011.

[4]  Andrew J. Davison,et al.  Active Matching , 2008, ECCV.

[5]  Jianxiong Xiao,et al.  Supervised Label Transfer for Semantic Segmentation of Street Scenes , 2010, ECCV.

[6]  Stefano Soatto,et al.  Quick Shift and Kernel Methods for Mode Seeking , 2008, ECCV.

[7]  Bill Triggs,et al.  Scene Segmentation with CRFs Learned from Partially Labeled Images , 2007, NIPS.

[8]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[9]  Ming-Hsuan Yang,et al.  Top-down visual saliency via joint CRF and dictionary learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jamie Shotton,et al.  The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[12]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Stefano Soatto,et al.  Class segmentation and object localization with superpixel neighborhoods , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[16]  Rong Jin,et al.  Unifying discriminative visual codebook generation with classifier training for object category recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[19]  Bastian Leibe,et al.  Joint 2D-3D temporally consistent semantic segmentation of street scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[22]  KohliPushmeet,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2009 .

[23]  Jana Kosecka,et al.  Semantic segmentation of street scenes by superpixel co-occurrence and 3D geometry , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[24]  Stephen Gould,et al.  Region-based Segmentation and Object Detection , 2009, NIPS.

[25]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[27]  Frédéric Jurie,et al.  Combining appearance models and Markov Random Fields for category level object segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Andrew Zisserman,et al.  Pylon Model for Semantic Segmentation , 2011, NIPS.

[29]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[31]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.