Cross-modal Retrieval based on Big Transfer and Regional Maximum Activation of Convolutions with Generalized Attention

Image-text retrieval is a challenge topic since image features are still not good enough to represent the high-level semantic information, though the representation ability is improved thanks to advances in deep learning. This paper proposes a cross-modal image-text retrieval framework (BiTGRMAC) based on big transfer and region maximum activation convolution with generalized attention, where big transfer (BiT) trained with large amount data is utilized to extract image features and fine-tuned on the cross-modal image datasets. At the same time, a new generalized attention region maximum activation convolution (GRMAC) descriptor is introduced into BiT, which can generate image features through attention mechanism, then reduce the influence of background clustering and highlight the target. For texts, the widely used Sentence CNN is adopted to extract text features. The parameters of image and text deep models are learned by minimizing a cross-modal loss function in an end-to-end framework. Experimental results show that this method can effectively improve the accuracy of retrieval on three widely used datasets.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[3]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[7]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Zhenguo Yang,et al.  Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval , 2018, ACM Multimedia.

[10]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[11]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[12]  Hu Tian,et al.  Cross-modal correlation learning with deep convolutional architecture , 2015, 2015 Visual Communications and Image Processing (VCIP).

[13]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[14]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[16]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[17]  Yan Hua,et al.  Uniting Image and Text Deep Networks via Bi-directional Triplet Loss for Retreival , 2019, 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC).

[18]  Lingyun Yu,et al.  Beauty Product Retrieval Based on Regional Maximum Activation of Convolutions with Generalized Attention , 2019, ACM Multimedia.

[19]  WangWei,et al.  Effective deep learning-based multi-modal retrieval , 2016, VLDB 2016.