论文信息 - Asymmetrically Weighted CCA And Hierarchical Kernel Sentence Embedding For Image & Text Retrieval

Asymmetrically Weighted CCA And Hierarchical Kernel Sentence Embedding For Image & Text Retrieval

Joint modeling of language and vision has been drawing increasing interest. A multimodal data representation allowing for bidirectional retrieval of images by sentences and vice versa is a key aspect. In this paper we present three contributions in canonical correlation analysis (CCA) based multimodal retrieval. Firstly, we show that an asymmetric weighting of the canonical weights, while achieving a cross view mapping from the search to the query space, improves the retrieval performance. Secondly, we devise a computationally efficient model selection, crucial to generalization and stability, in the framework of the Bj\"ork Golub algorithm for regularized CCA via spectral filtering. Finally, we introduce a Hierarchical Kernel Sentence Embedding (HKSE) that approximates Kernel CCA for a special similarity kernel between distribution of words embedded in a vector space. State of the art results are obtained on MSCOCO and Flickr benchmarks when these three techniques are used in conjunction.

Vaibhava Goel | Youssef Mroueh | Etienne Marcheret

[1] Tomoharu Iwata,et al. Cross-Domain Matching for Bag-of-Words Data via Kernel Embeddings of Latent Distributions , 2015, NIPS.

[2] Andrew McCallum,et al. Word Representations via Gaussian Embedding , 2014, ICLR.

[3] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[5] Svetlana Lazebnik,et al. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[6] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[7] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Gene H. Golub,et al. Numerical methods for computing angles between linear subspaces , 1971, Milestones in Matrix Computation.

[10] Le Song,et al. A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[11] H. Vinod. Canonical ridge and econometrics of joint production , 1976 .

[12] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Andreas Christmann,et al. Universal Kernels on Non-Standard Input Spaces , 2010, NIPS.

[15] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[16] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[17] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[18] Wei Xu,et al. Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[19] Lior Wolf,et al. Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[21] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[23] HansenPer Christian. The truncated SVD as a method for regularization , 1987 .

[24] R. Kondor,et al. Bhattacharyya and Expected Likelihood Kernels , 2003 .

[25] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.