Towards Improving Canonical Correlation Analysis for Cross-modal Retrieval

Building correlations for cross-modal retrieval, i.e., image-to-text retrieval and text-to-image retrieval, is a feasible solution to bridge the semantic gap between different modalities. Canonical correlation analysis (CCA) based methods have ever achieved great successes. However, conventional 2-view CCA suffers from three inherent problems: 1) it fails to capture the intra-modal semantic consistency, which is a necessary element to improve the retrieval performance, 2) it is hard to learn the non-linear correlation between different modalities, and 3) there exists problem in similarity measure due to the fact that the latent space learned by CCA is not directly optimized with certain distance measure. To address above problem, in this paper, we propose an improved CCA algorithm (ICCA) from three aspects. First, we propose two effective semantic features based on text features to improve intra-modal semantic consistency. Second, we expand traditional CCA from 2-view to 4-view, and embed 4-view CCA into a progressive framework to alleviate the over-fitting. Our progressive framework combines the training of linear projection and nonlinear hidden layers to ensure that good representations of the input raw data are learned at the output of the network. Third, inspired by large scale similarity learning (LSSL), a similarity metric is proposed to improve the distance measure. Experiments on three publicly data sets demonstrate the effectiveness of the proposed ICCA method.

[1]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[2]  Fei Su,et al.  Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval , 2016, Neurocomputing.

[3]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[4]  Kai Liu,et al.  Datum-Adaptive Local Metric Learning for Person Re-identification , 2015, IEEE Signal Processing Letters.

[5]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[6]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[7]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[10]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[11]  Shengcai Liao,et al.  Large Scale Similarity Learning Using Similar Pairs for Person Verification , 2016, AAAI.

[12]  Fei Su,et al.  Efficient multi-modal hypergraph learning for social image classification with complex label correlations , 2016, Neurocomputing.

[13]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[16]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  Qingming Huang,et al.  Cross-modal Retrieval by Real Label Partial Least Squares , 2016, ACM Multimedia.

[19]  Xiaojie Wang,et al.  Correspondence Autoencoders for Cross-Modal Retrieval , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[20]  BengioSamy,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008 .

[21]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[22]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[24]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[25]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[26]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.