Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval

This paper deals with the problem of modeling Internet images and associated texts for cross-modal retrieval such as text-to-image retrieval and image-to-text retrieval. We start with deep canonical correlation analysis (DCCA), a deep approach for mapping text and image pairs into a common latent space. We first propose a novel progressive framework and embed DCCA in it. In our progressive framework, a linear projection loss layer is inserted before the nonlinear hidden layers of a deep network. The training of linear projection and the training of nonlinear layers are combined to ensure that the linear projection is well matched with the nonlinear processing stages and good representations of the input raw data are learned at the output of the network. Then we introduce a hypergraph semantic embedding (HSE) method, which extracts latent semantics from texts, into DCCA to regularize the latent space learned by image view and text view. In addition, a search-based similarity measure is proposed to score relevance of image-text pairs. Based on the above ideas, we propose a model, called DCCA-PHS, for cross-modal retrieval. Experiments on three publicly available data sets show that DCCA-PHS is effective and efficient, and achieves state-of-the-art performance for unsupervised scenario.

[1]  Qi Tian,et al.  Semantic consistency hashing for cross-modal retrieval , 2016, Neurocomputing.

[2]  J. Rodríguez On the Laplacian Spectrum and Walk-regular Hypergraphs , 2003 .

[3]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[5]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[6]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[7]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[8]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[9]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[10]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[11]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[12]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Qi Tian,et al.  Cluster-sensitive Structured Correlation Analysis for Web cross-modal retrieval , 2015, Neurocomputing.

[15]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[16]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[17]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[18]  David A. Forsyth,et al.  Representation Learning , 2015, Computer.

[19]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[21]  Shiliang Sun,et al.  Active learning with extremely sparse labeled examples , 2010, Neurocomputing.

[22]  Jianjun Wang,et al.  Kernel canonical correlation analysis via gradient descent , 2016, Neurocomputing.

[23]  Tat-Jun Chin,et al.  Clustering with Hypergraphs: The Case for Large Hyperedges , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[25]  Ruifan Li,et al.  Deep correspondence restricted Boltzmann machine for cross-modal retrieval , 2015, Neurocomputing.

[26]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[27]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[28]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qingshan Liu,et al.  Image retrieval via probabilistic hypergraph ranking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[31]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[32]  Martine D. F. Schlag,et al.  Multi-level spectral hypergraph partitioning with arbitrary vertex sizes , 1996, Proceedings of International Conference on Computer Aided Design.

[33]  BengioSamy,et al.  Large scale image annotation , 2010 .

[34]  Daoqiang Zhang,et al.  Canonical sparse cross-view correlation analysis , 2016, Neurocomputing.

[35]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[36]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[37]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[38]  Jing-Yu Yang,et al.  Unsupervised discriminant canonical correlation analysis based on spectral clustering , 2016, Neurocomputing.