A Discriminative Latent Model of Image Region and Object Tag Correspondence

We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth region-to-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.

[1]  Andrew Zisserman,et al.  Structured output regression for detection with partial truncation , 2009, NIPS.

[2]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[3]  Alexander C. Berg,et al.  Who's In the Picture , 2004, NIPS 2004.

[4]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[8]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[9]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Quanfu Fan,et al.  Reducing correspondence ambiguity in loosely labeled training data , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Yang Wang,et al.  A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[12]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Fei-Fei Li,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, CVPR.

[14]  Thierry Artières,et al.  Large margin training for hidden Markov models with partially observed states , 2009, ICML '09.

[15]  Alexei A. Efros,et al.  Recognition by association via learning per-exemplar distances , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[17]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[19]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Yang Wang,et al.  Beyond Actions: Discriminative Models for Contextual Group Activities , 2010, NIPS.

[22]  Charless C. Fowlkes,et al.  Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[23]  Ali Farhadi,et al.  Scene Discovery by Matrix Factorization , 2008, ECCV.

[24]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[25]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[26]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.