An output aggregation system for large scale cross-modal retrieval

This paper presents our solution to MSR-Bing Image Retrieval Challenge to measure the relevance of web images and the query given in text form. We compare and integrate three typical methods (SVM-based, CCA-based, PAMIR) to conduct the large-scale cross-modal retrieval task with concept-level visual features. In SVM-based approach, the relevance of the image and the query is scored using an on-line trained SVM classifier for the query. With canonical correlation analysis (CCA), the correlations between images and queries (i.e., text) are maximized by learning a pair of linear transformations. PAMIR [1] formalizes the retrieval task as a ranking problem and introduces a learning procedure to optimize a ranking-related criterion by projecting the images to the text space. By using the concept-level visual features obtained with convolution neural network (CNN), our output aggregation system achieves 50.93% and 51.23% in terms of NDCG@ 25 on development and test data respectively.

[1]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[2]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[3]  Songcan Chen,et al.  Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[4]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Wei-Ying Ma,et al.  Annotating Images by Mining Image Search Results , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Daoqiang Zhang,et al.  A New Canonical Correlation Analysis Algorithm with Local Discrimination , 2010, Neural Processing Letters.

[7]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[8]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[9]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[10]  Ivor W. Tsang,et al.  Textual Query of Personal Photos Facilitated by Large-Scale Web Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.