Cross-modal subspace learning via kernel correlation maximization and discriminative structure-preserving

How to measure the distance between heterogeneous data is still an open problem. Many research works have been developed to learn a common subspace where the similarity between different modalities can be calculated directly. However, most of existing works focus on learning a latent subspace but the semantically structural information is not well preserved. Thus, these approaches cannot get desired results. In this paper, we propose a novel framework, termed Cross-modal subspace learning via Kernel correlation maximization and Discriminative structure-preserving (CKD), to solve this problem in two aspects. Firstly, we construct a shared semantic graph to make each modality data preserve the neighbor relationship semantically. Secondly, we introduce the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency between feature-similarity and semantic-similarity of samples. Our model not only considers the inter-modality correlation by maximizing the kernel correlation but also preserves the semantically structural information within each modality. The extensive experiments are performed to evaluate the proposed framework on the three public datasets. The experimental results demonstrate that the proposed CKD is competitive compared with the classic subspace learning methods.

[1]  Alberto Del Bimbo,et al.  Matching People across Camera Views using Kernel Canonical Correlation Analysis , 2014, ICDSC.

[2]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[3]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[4]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[5]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Qingming Huang,et al.  Relative image similarity learning with contextual information for Internet cross-media retrieval , 2013, Multimedia Systems.

[7]  Nicu Sebe,et al.  Image retrieval using wavelet-based salient points , 2001, J. Electronic Imaging.

[8]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[9]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[10]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  José Carlos Príncipe,et al.  Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces , 2010, Information Theoretic Learning.

[12]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[13]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[14]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[15]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[16]  Dahua Lin,et al.  Inter-modality Face Recognition , 2006, ECCV.

[17]  Heng Tao Shen,et al.  Collective Reconstructive Embeddings for Cross-Modal Hashing , 2019, IEEE Transactions on Image Processing.

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[20]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jianfei Cai,et al.  Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning , 2017, Pattern Recognit..

[22]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[23]  Raimondo Schettini,et al.  Retinex preprocessing of uncalibrated images for color-based image retrieval , 2003, J. Electronic Imaging.

[24]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiaojun Wu,et al.  A novel contour descriptor for 2D shape matching and its application to image retrieval , 2011, Image Vis. Comput..

[26]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[27]  Jun Yu,et al.  Semi-supervised Hashing for Semi-Paired Cross-View Retrieval , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[28]  Yu Cheng,et al.  A car-face region-based image retrieval method with attention of SIFT features , 2016, Multimedia Tools and Applications.

[29]  Qi Tian,et al.  $\mathcal {L}_p$ -Norm IDF for Scalable Image Retrieval , 2014, IEEE Transactions on Image Processing.

[30]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[31]  Jun Yu,et al.  Discriminative Supervised Hashing for Cross-Modal similarity Search , 2019, Image Vis. Comput..

[32]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Xinbo Gao,et al.  Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yao Zhao,et al.  Subspace learning by kernel dependence maximization for cross-modal retrieval , 2018, Neurocomputing.

[36]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[37]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[38]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[39]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jian-Ping Li,et al.  GEO matching regions: multiple regions of interests using content based image retrieval based on relative locations , 2017, Multimedia Tools and Applications.

[41]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Quan Wang,et al.  Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[44]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[46]  Qi Tian,et al.  Multimodal Similarity Gaussian Process Latent Variable Model , 2017, IEEE Transactions on Image Processing.

[47]  Lin Feng,et al.  Spectral embedding-based multiview features fusion for content-based image retrieval , 2017, J. Electronic Imaging.

[48]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[49]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.