Vision-language integration using constrained local semantic features

Abstract This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.

[1]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[2]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[3]  Adrian Popescu,et al.  Constrained Local Enhancement of Semantic Features by Content-Based Sparsity , 2016, ICMR.

[4]  Adrian Popescu,et al.  Large-Scale Image Mining with Flickr Groups , 2015, MMM.

[5]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[6]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[7]  Céline Hudelot,et al.  Diverse Concept-Level Features for Multi-Object Classification , 2016, ICMR.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Kristen Grauman,et al.  Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search , 2011, International Journal of Computer Vision.

[11]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[12]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[13]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[18]  Yu Zhang,et al.  Exploit Bounding Box Annotations for Multi-Label Object Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[20]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[21]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Michel Crucianu,et al.  Aggregating Image and Text Quantized Correlated Components , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lorenzo Torresani,et al.  Meta-class features for large-scale object categorization on a budget , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  C. Lawrence Zitnick,et al.  Fast Edge Detection Using Structured Forests , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[26]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[27]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[28]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[30]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Radomír Mech,et al.  Unconstrained Salient Object Detection via Proposal Subset Optimization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[37]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[39]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[41]  Qi Tian,et al.  Image Classification and Retrieval are ONE , 2015, ICMR.

[42]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[45]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[46]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[47]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[48]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.