A Structured Listwise Approach to Learning to Rank for Image Tagging

With the growing quantity and diversity of publicly available image data, computer vision plays a crucial role in understanding and organizing visual information today. Image tagging models are very often used to make this data accessible and useful. Generating image labels and ranking them by their relevance to the visual content is still an open problem. In this work, we use a bilinear compatibility function inspired from zero-shot learning that allows us to rank tags according to their relevance to the image content. We propose a novel listwise structured loss formulation to learn it from data. We leverage captioned image data and propose different “tags from captions” schemes meant to capture user attention and intra-user agreement in a simple and effective manner. We evaluate our method on the COCO-Captions, PASCAL-sentences and MIRFlickr-25k datasets showing promising results.

[1]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wei-Lun Chao,et al.  An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild , 2016, ECCV.

[5]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[6]  Angeliki Lazaridou,et al.  Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.

[7]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Alberto Del Bimbo,et al.  Socializing the Semantic Gap , 2015, ACM Comput. Surv..

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Shuicheng Yan,et al.  Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[14]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[17]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[20]  Michael R. Lyu,et al.  Bridging the Semantic Gap Between Image Contents and Tags , 2010, IEEE Transactions on Multimedia.

[21]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[22]  David A. Forsyth,et al.  Discriminating Image Senses by Clustering with Multimodal Features , 2006, ACL.

[23]  Ying He,et al.  Mining social images with distance metric learning for automated image tagging , 2011, WSDM '11.

[24]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[25]  Carina Silberer,et al.  Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Cordelia Schmid,et al.  Image annotation with tagprop on the MIRFLICKR set , 2010, MIR '10.

[27]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[28]  Shaogang Gong,et al.  Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content , 2018, IEEE Signal Processing Magazine.

[29]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[30]  Bernt Schiele,et al.  Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[34]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[35]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.