Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval

This paper proposes a novel deep framework of multi-networks joint learning for large-scale cross-modal retrieval. For most existing cross-modal methods, the processes of training and testing don't care about the problem of memory requirement. Hence, they are generally implemented on small-scale data. Moreover, they take feature learning and latent space embedding as two separate steps which cannot generate specific features to accord with the cross-modal task. To alleviate the problems, we first disintegrate the multiplication and inverse of some big matrices, usually involved in existing methods, into that of many sub-matrices. Each sub-matrix is targeted to dispose one pair of image-sentence, for which we further design a novel sampling strategy to select the most representative samples to construct the cross-modal ranking loss and within-modal discriminant loss functions. By this way, the proposed model consumes less memory each time such that it can scale to large-scale data. Furthermore, we apply the proposed discriminative ranking loss to effectively unify two heterogenous networks, deep residual network for images and long short-term memory for sentences, into an end-to-end deep learning architecture. Finally, we can simultaneously achieve specific features adapting to cross-modal task and learn a shared latent space for images and sentences. Extensive evaluations on two large-scale cross-modal datasets show that the proposed method brings substantial improvements over other state-of-the-art ranking methods.

[1]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[2]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[4]  Jian Pei,et al.  Parallel field alignment for cross media retrieval , 2013, ACM Multimedia.

[5]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[6]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[8]  Nuno Vasconcelos,et al.  Bridging the Gap: Query by Semantic Example , 2007, IEEE Transactions on Multimedia.

[9]  Nikhil Rasiwasia,et al.  Cluster Canonical Correlation Analysis , 2014, AISTATS.

[10]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[11]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Nuno Vasconcelos,et al.  Maximum Covariance Unfolding : Manifold Learning for Bimodal Data , 2011, NIPS.

[13]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[14]  Gert R. G. Lanckriet,et al.  Efficient Learning of Mahalanobis Metrics for Ranking , 2014, ICML.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[17]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[22]  A. Zuger A New Approach to , 2013 .

[23]  Yizhou Yu,et al.  Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-Tuning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[28]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[29]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[30]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[31]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[33]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[34]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[35]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[36]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[37]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[38]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[41]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[43]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[44]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[45]  Qi Tian,et al.  PL-ranking: A Novel Ranking Method for Cross-Modal Retrieval , 2016, ACM Multimedia.