Deep Pairwise Ranking with Multi-label Information for Cross-Modal Retrieval

Cross-modal retrieval has gained much attention due to the growing demand for enormous multi-modal data in recent years (i.e., image-text or text-image retrieval). In order to alleviate the problem of ignoring the existence of irrelevant information between images and texts, this paper proposes Deep Pairwise Ranking model with multi-label information for Cross-Modal retrieval (DPRCM). DPRCM directly learns a mapping from images and texts to a compact Euclidean space where distances correspond to the similarity measure of images and texts. The bi-triplet loss function in DPRCM reduces the distance between associated images and texts on the common subspace and increases the margin of independent samples. The classification loss function can better utilize the multi-label information to reduce the semantic gap between image features and text descriptions. Experiments on three widely-used datasets show that DPRCM can achieve competitive performance compared to state-of-the-art methods.

[1]  Qi Tian,et al.  Adaptively Unified Semi-supervised Learning for Cross-Modal Retrieval , 2017, IJCAI.

[2]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[3]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[4]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Yang Cao,et al.  Cross-Modal Learning to Rank with Adaptive Listwise Constraint , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Qingming Huang,et al.  Metric based on multi-order spaces for cross-modal retrieval , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Huimin Lu,et al.  Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[9]  Jiwen Lu,et al.  Rank-Consistency Multi-Label Deep Hashing , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[10]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[11]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[12]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[13]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[14]  Angel Domingo Sappa,et al.  Infrared Image Colorization Based on a Triplet DCGAN Architecture , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yueting Zhuang,et al.  Cross-Modal Learning to Rank via Latent Joint Representation , 2015, IEEE Transactions on Image Processing.

[18]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[19]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[20]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[23]  Yonghyun Kim,et al.  Deep Convolutional Neural Network Using Triplets of Faces, Deep Ensemble, and Score-Level Fusion for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Kristen Grauman,et al.  Reading between the lines: Object localization using implicit cues from image tags , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.