Image Tagging via Cross-Modal Semantic Mapping

Images without annotations are ubiquitous on the Internet, and recommending tags for them has become a challenging open task in image understanding. A common bottleneck of related work is the semantic gap between the image and text representations. In this paper, we bridge the gap by introducing a semantic layer, the space of word embeddings that represents the image tags as the word vectors. Our model first learns the optimal mapping from the visual space to the semantic space using training sources. Then we annotate test images by decoding the semantic representations of the visual features. Extensive experiments demonstrate that our model outperforms the state-of-the-art approaches in predicting the image tags.

[1]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[2]  Hai Jin,et al.  Image label completion by pursuing contextual decomposability , 2012, TOMCCAP.

[3]  Lei Wu,et al.  Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[5]  Hongliang Yu,et al.  A Joint Optimization Model for Image Summarization Based on Image Content and Tags , 2014, AAAI.

[6]  Dacheng Tao,et al.  ReLISH: Reliable Label Inference via Smoothness Hypothesis , 2014, AAAI.

[7]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[11]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[12]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[13]  Jianmin Wang,et al.  Image Tag Completion via Image-Specific and Tag-Specific Linear Sparse Reconstructions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[16]  Vladimir Pavlovic,et al.  A New Baseline for Image Annotation , 2008, ECCV.