论文信息 - Cross-media Relevance Computation for Multimedia Retrieval

Cross-media Relevance Computation for Multimedia Retrieval

In this paper, we summarize our works for cross-media retrieval where the queries and retrieval content are of different media types. We study cross-media retrieval in the context of two applications, i.e., ~image retrieval by textual queries, and sentence retrieval by visual queries, two popular applications in multimedia retrieval. For image retrieval by textual queries, we proposetext2image which converts computing cross-media relevance between images and textual queries to comparing the visual similarity among images.We also proposecross-media relevance fusion, a conceptual framework that combines multiple cross-media relevance estimators.These two techniques have resulted in a winning entry in the Microsoft Image Retrieval Challenge at ACM MM 2015. For sentence retrieval by visual queries, we propose to compute cross-media relevance in a visual space exclusively. We contributeWord2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. With proposedWord2VisualVec model, we won the Video to Text Description task at TRECVID 2016.

Jianfeng Dong

[1] Hongxun Yao,et al. Learning Cross Space Mapping via DNN Using Large Scale Click-Through Logs , 2015, IEEE Transactions on Multimedia.

[2] Xirong Li,et al. University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video , 2016, TRECVID.

[3] Björn W. Schuller,et al. Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[4] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Cees Snoek,et al. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[6] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[7] Xirong Li,et al. Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction , 2016 .

[8] Yanjun Qi,et al. Polynomial Semantic Indexing , 2009, NIPS.

[9] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[11] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Yanjun Qi,et al. Supervised semantic indexing , 2009, ECIR.

[13] Xiaoyong Du,et al. Image Retrieval by Cross-Media Relevance Fusion , 2015, ACM Multimedia.

[14] Chong-Wah Ngo,et al. Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[15] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16] Xirong Li,et al. Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[17] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[18] Chong-Wah Ngo,et al. Image search by graph-based label propagation with image representation from DNN , 2013, MM '13.

[19] Lior Wolf,et al. Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Cees Snoek,et al. Image2Emoji: Zero-shot Emoji Prediction for Visual Media , 2015, ACM Multimedia.

[21] Lior Wolf,et al. RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[22] W. Bruce Croft,et al. Linear feature-based models for information retrieval , 2007, Information Retrieval.

[23] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[24] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Xiaoyong Du,et al. Zero-shot Image Tagging by Hierarchical Semantic Embedding , 2015, SIGIR.

[26] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Chong-Wah Ngo,et al. Click-through-based cross-view learning for image search , 2014, SIGIR.

[28] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[29] Yueting Zhuang,et al. Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[30] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Wei-Ying Ma,et al. Bag-of-Words Based Deep Neural Network for Image Retrieval , 2014, ACM Multimedia.

[33] Jing Wang,et al. Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[34] Roger Levy,et al. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Xirong Li,et al. Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild , 2017, IEEE Transactions on Multimedia.

[36] Jonathan G. Fiscus,et al. TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[37] Wei Liu,et al. Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval , 2016, IEEE Transactions on Multimedia.

[38] Chong-Wah Ngo,et al. Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).