论文信息 - An output aggregation system for large scale cross-modal retrieval

An output aggregation system for large scale cross-modal retrieval

This paper presents our solution to MSR-Bing Image Retrieval Challenge to measure the relevance of web images and the query given in text form. We compare and integrate three typical methods (SVM-based, CCA-based, PAMIR) to conduct the large-scale cross-modal retrieval task with concept-level visual features. In SVM-based approach, the relevance of the image and the query is scored using an on-line trained SVM classifier for the query. With canonical correlation analysis (CCA), the correlations between images and queries (i.e., text) are maximized by learning a pair of linear transformations. PAMIR [1] formalizes the retrieval task as a ranking problem and introduces a learning procedure to optimize a ranking-related criterion by projecting the images to the text space. By using the concept-level visual features obtained with convolution neural network (CNN), our output aggregation system achieves 50.93% and 51.23% in terms of NDCG@ 25 on development and test data respectively.

[1] Koby Crammer,et al. Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[2] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[3] Songcan Chen,et al. Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[4] Samy Bengio,et al. A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Wei-Ying Ma,et al. Annotating Images by Mining Image Search Results , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Daoqiang Zhang,et al. A New Canonical Correlation Analysis Algorithm with Local Discrimination , 2010, Neural Processing Letters.

[7] Tat-Seng Chua,et al. NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[8] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[9] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[10] Ivor W. Tsang,et al. Textual Query of Personal Photos Facilitated by Large-Scale Web Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.