Learning Deep Features For MSR-bing Information Retrieval Challenge

Two tasks have been put forward in the MSR-bing Grand Challenge 2015. To address the information retrieval task, we raise and integrate a series of methods with visual features obtained by convolution neural network (CNN) models. In our experiments, we discover that the ranking strategies of Hierarchical clustering and PageRank methods are mutually complementary. Another task is fine-grained classification. In contrast to basic-level recognition, fine-grained classification aims to distinguish between different breeds or species or product models, and often requires distinctions that must be conditioned on the object pose for reliable identification. Current state-of-the-art techniques rely heavily upon the use of part annotations, while the bing datasets suffer both abundance of part annotations and dirty background. In this paper, we propose a CNN-based feature representation for visual recognition only using image-level information. Our CNN model is pre-trained on a collection of clean datasets and fine-tuned on the bing datasets. Furthermore, a multi-scale training strategy is adopted by simply resizing the input images into different scales and then merging the soft-max posteriors. We then implement our method into a unified visual recognition system on Microsoft cloud service. Finally, our solution achieved top performance in both tasks of the contest

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  David W. Jacobs,et al.  Dog Breed Classification Using Part Localization , 2012, ECCV.

[3]  Fei Su,et al.  An output aggregation system for large scale cross-modal retrieval , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[4]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[5]  Yan-Ying Chen,et al.  Search-based relevance association with auxiliary contextual cues , 2013, MM '13.

[6]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[7]  Jian Wang,et al.  Cross Modal Deep Model and Gaussian Process Based Model for MSR-Bing Challenge , 2014, ACM Multimedia.

[8]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Yi Yang,et al.  Cross-media relevance mining for evaluating text-based image search engine , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[12]  Hanqing Lu,et al.  Online sketching hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).