Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia

Cross-media retrieval aims at seeking the semantic association between different media types. Most existing methods paid much attention on learning mapping functions or finding the optimal spaces, but neglected how people accurately cognize images and texts. This paper proposes a brain inspired cross-media retrieval framework to learn rich semantic embeddings of multimedia. Different from directly using off-the-shelf image features, we combine the visual and descriptive senses for an image from the view of human perception via a joint model, called multi-sensory fusion network (MSFN). A topic model based TextNet maps texts into the same semantic space as images according to their shared ground truth labels. Moreover, in order to overcome the limitations of insufficient data for training neural networks and less complexity in text form, we introduce a large-scale image-text dataset, called Britannica dataset. Extensive experiments show the effectiveness of our framework for different lengths of texts on three benchmark datasets as well as Britannica dataset. Most of all, we report the best known average results of Img2Text and Text2Img compared with several state-of-the-art methods.

[1]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[4]  Dong Cao,et al.  Self-Paced Cross-Modal Subspace Matching , 2016, SIGIR.

[5]  Wei Wang,et al.  Continuum regression for cross-modal multimedia retrieval , 2012, 2012 19th IEEE International Conference on Image Processing.

[6]  Xirong Li,et al.  Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction , 2016, ArXiv.

[7]  Kristen Grauman,et al.  Reading between the lines: Object localization using implicit cues from image tags , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  E. LESTER SMITH,et al.  AND OTHERS , 2005 .

[9]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[10]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[12]  Ulises Cortés,et al.  Extracting Visual Patterns from Deep Learning Representations , 2015, ArXiv.

[13]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[17]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[18]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wei-Ying Ma,et al.  Data-driven approach for bridging the cognitive gap in image retrieval , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[21]  Kristen Grauman,et al.  Accounting for the Relative Importance of Objects in Image Retrieval , 2010, BMVC.

[22]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Alberto Del Bimbo,et al.  A Cross-media Model for Automatic Image Annotation , 2014, ICMR.

[25]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[26]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[27]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[28]  Christoph Meinel,et al.  Deep Semantic Mapping for Cross-Modal Retrieval , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[29]  L. Miles,et al.  2000 , 2000, RDH.

[30]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[32]  Cees Snoek,et al.  Discovering Semantic Vocabularies for Cross-Media Retrieval , 2015, ICMR.

[33]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Preeti Verghese,et al.  The psychophysics of visual search , 2000, Vision Research.

[36]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[39]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.