Deep Semantic Space with Intra-class Low-rank Constraint for Cross-modal Retrieval

In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks for modality-specific representation learning, followed by projection layers for common space mapping. In particular, DSSIL takes into account semantic consistency to fuse the cross-modal data in a high-level common space, and constrains the common representation matrix within the same class to be low-rank, in order to induce the intra-class representations more relevant. More formally, two regularization terms are devised for the two aspects, which have been incorporated into the objective of DSSIL. To optimize the modality-specific subnetworks and the projection layers simultaneously by exploiting the gradient decent directly, we approximate the nonconvex low-rank constraint by minimizing a few smallest singular values of the intra-class matrix with theoretical analysis. Extensive experiments conducted on three public datasets demonstrate the competitive superiority of DSSIL for cross-modal retrieval compared with the state-of-the-art methods.

[1]  Linlin Zong,et al.  Incomplete multi-view spectral clustering , 2020, J. Intell. Fuzzy Syst..

[2]  Qi Tian,et al.  Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval , 2017, ACM Multimedia.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kien A. Hua,et al.  Linear Subspace Ranking Hashing for Cross-Modal Retrieval , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Hong Liu,et al.  Incomplete Multiview Spectral Clustering With Adaptive Graph Learning , 2020, IEEE Transactions on Cybernetics.

[6]  Yun Fu,et al.  Deep Transfer Low-Rank Coding for Cross-Domain Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Xiao Cai On The Equivalent of Low-Rank Regressions and Linear Discriminant Analysis Based Regressions , 2013 .

[8]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[9]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  C. Eckart,et al.  A principal axis transformation for non-hermitian matrices , 1939 .

[11]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming Shao,et al.  Generative Zero-Shot Learning via Low-Rank Embedded Semantic Dictionary , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Feiping Nie,et al.  Subspace Clustering via New Low-Rank Model with Discrete Group Structure Constraint , 2016, IJCAI.

[14]  Guillermo Sapiro,et al.  Not Afraid of the Dark: NIR-VIS Face Recognition via Cross-Spectral Hallucination and Low-Rank Embedding , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[16]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[17]  Qing Li,et al.  Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning , 2018, PCM.

[18]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Guillermo Sapiro,et al.  Learning transformations for clustering and classification , 2013, J. Mach. Learn. Res..

[20]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[21]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[22]  Wu Jigang,et al.  Unsupervised feature extraction by low-rank and sparsity preserving embedding , 2019, Neural Networks.

[23]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  C. Ding,et al.  On the equivalent of low-rank linear regressions and linear discriminant analysis based regressions , 2013, KDD.

[25]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[26]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[27]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[28]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[29]  Wai Keung Wong,et al.  Approximate Low-Rank Projection Learning for Feature Extraction , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Wenyin Liu,et al.  Shared Multi-View Data Representation for Multi-Domain Event Detection , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Qing Li,et al.  Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[32]  Wei Zhang,et al.  Supervised Group Sparse Representation via Intra-class Low-Rank Constraint , 2018, CCBR.

[33]  Yong Yu,et al.  Robust Recovery of Subspace Structures by Low-Rank Representation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[35]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.