Subspace learning by kernel dependence maximization for cross-modal retrieval

Abstract Heterogeneity of multi-modal data is the key challenge for multimedia cross-modal retrieval. To solve this challenge, many approaches have been developed. As the mainstream, subspace learning based approaches focus on learning a latent shared subspace to measure similarities between cross-modal data, and have shown their remarkable performance in practical cross-modal retrieval tasks. However, most of the existing approaches are intrinsically identified with feature dimension reduction on different modalities in a shared subspace, unable to fundamentally resolve the heterogeneity issue well; therefore they often can not obtain satisfactory results as expected. As claimed in Hilbert space theory, different Hilbert spaces with the same dimension are isomorphic. Based on this premise, isomorphic mapping subspaces can be considered as a single space shared by multi-modal data. To this end, we in this paper propose a correlation-based cross-modal subspace learning model via kernel dependence maximization (KDM). Unlike most of the existing correlation-based subspace learning methods, the proposed KDM learns subspace representation for each modality by maximizing the kernel dependence (correlation) instead of directly maximizing the feature correlations between multi-modal data. Specifically, we first map multi-modal data into different Hilbert spaces but with the same dimension individually, then we calculate kernel matrix in each Hilbert space and measure the correlations between multi-modalities based on kernels. Experimental results have shown the effectiveness and competitiveness of the proposed KDM against the compared classic subspace learning approaches.

[1]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[2]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[4]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[6]  Xiaochun Cao,et al.  Diversity-induced Multi-view Subspace Clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[8]  Ruifan Li,et al.  Deep correspondence restricted Boltzmann machine for cross-modal retrieval , 2015, Neurocomputing.

[9]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[11]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[12]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[13]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[14]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[15]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[17]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[18]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[19]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[20]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Michael K. Ng,et al.  Sparse Canonical Correlation Analysis: New Formulation and Algorithm , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Liang Wang,et al.  Cross-Modal Subspace Learning via Pairwise Constraints , 2014, IEEE Transactions on Image Processing.

[23]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[24]  Yongdong Zhang,et al.  Supervised Hash Coding With Deep Neural Network for Environment Perception of Intelligent Vehicles , 2018, IEEE Transactions on Intelligent Transportation Systems.

[25]  Mohamed S. Kamel,et al.  Kernelized Supervised Dictionary Learning , 2012, IEEE Transactions on Signal Processing.

[26]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[27]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[28]  Qingming Huang,et al.  Relative image similarity learning with contextual information for Internet cross-media retrieval , 2013, Multimedia Systems.

[29]  Xing Xu,et al.  Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts , 2015, ACM Multimedia.

[30]  Xiaohua Zhai,et al.  Semi-Supervised Cross-Media Feature Learning With Unified Patch Graph Regularization , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  José Carlos Príncipe,et al.  Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces , 2010, Information Theoretic Learning.

[32]  Zhi-Hua Zhou,et al.  Multilabel dimensionality reduction via dependence maximization , 2008, TKDD.

[33]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[34]  Bernhard Schölkopf,et al.  Kernel machine based learning for multi-view face detection and pose estimation , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[35]  Wei Li,et al.  Kernel Learning with Hilbert-Schmidt Independence Criterion , 2016, CCPR.

[36]  Fakhri Karray,et al.  Multiview Supervised Dictionary Learning in Speech Emotion Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Yongdong Zhang,et al.  Effective Uyghur Language Text Detection in Complex Background Images for Traffic Prompt Identification , 2018, IEEE Transactions on Intelligent Transportation Systems.

[38]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[39]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[41]  Qi Tian,et al.  Multimodal Similarity Gaussian Process Latent Variable Model , 2017, IEEE Transactions on Image Processing.

[42]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[43]  Fei Su,et al.  Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval , 2016, Neurocomputing.

[44]  Yao Zhao,et al.  Unsupervised Multi-view Subspace Learning via Maximizing Dependence , 2017, CCCV.

[45]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..