Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization

Cross-media retrieval has become a key problem in both research and application, in which users can search results across all of the media types (text, image, audio, video, and 3-D) by submitting a query of any media type. How to measure the content similarity among different media is the key challenge. Existing cross-media retrieval methods usually focus on modeling the pairwise correlation or semantic information separately. In fact, these two kinds of information are complementary to each other and optimizing them simultaneously can further improve the accuracy. In this paper, we propose a novel feature learning algorithm for cross-media data, called joint representation learning (JRL), which is able to explore jointly the correlation and semantic information in a unified optimization framework. JRL integrates the sparse and semisupervised regularization for different media types into one unified optimization problem, while existing feature learning methods generally focus on a single media type. On one hand, JRL learns sparse projection matrix for different media simultaneously, so different media can align with each other, which is robust to the noise. On the other hand, both the labeled data and unlabeled data of different media types are explored. Unlabeled examples of different media types increase the diversity of training data and boost the performance of joint representation learning. Furthermore, JRL can not only reduce the dimension of the original features, but also incorporate the cross-media correlation into the final representation, which further improves the performance of both cross-media retrieval and single-media retrieval. Experiments on two datasets with up to five media types show the effectiveness of our proposed approach, as compared with the state-of-the-art methods.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[3]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[4]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[5]  Ming Ouhyoung,et al.  On Visual Similarity Based 3D Model Retrieval , 2003, Comput. Graph. Forum.

[6]  Hayit Greenspan,et al.  Probabilistic space-time video modeling via piecewise GMM , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Remco C. Veltkamp,et al.  A Survey of Music Information Retrieval Systems , 2005, ISMIR.

[8]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Yuxin Peng,et al.  Clip-based similarity measure for query-dependent clip retrieval and video summarization , 2006, IEEE Trans. Circuits Syst. Video Technol..

[10]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[11]  Gérard Chollet,et al.  Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Giovanni Giuffrida,et al.  Data mining learning bootstrap through semantic thumbnail analysis , 2007, Electronic Imaging.

[13]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[14]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[15]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Hugo Jair Escalante,et al.  Late fusion of heterogeneous methods for multimedia image retrieval , 2008, MIR '08.

[18]  Qi Tian,et al.  Semantic Subspace Projection and Its Applications in Image Retrieval , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Giovanni Giuffrida,et al.  Using visual and text features for direct marketing on multimedia messaging services domain , 2009, Multimedia Tools and Applications.

[20]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[21]  Xing Xie,et al.  Coherent Phrase Model for Efficient Image Near-Duplicate Retrieval , 2009, IEEE Transactions on Multimedia.

[22]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.

[23]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[24]  Chong-Wah Ngo,et al.  Coherent bag-of audio words model for efficient large-scale video copy detection , 2010, CIVR '10.

[25]  Changsheng Xu,et al.  Cross-media retrieval: state-of-the-art and open issues , 2010, Int. J. Multim. Intell. Secur..

[26]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[27]  Giovanni Maria Farinella,et al.  Bags of phrases with codebooks alignment for near duplicate image detection , 2010, MiFor '10.

[28]  Gabriela Csurka,et al.  Semantic combination of textual and visual information in multimedia retrieval , 2011, ICMR.

[29]  Zi Huang,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence ℓ2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning , 2022 .

[30]  Nicu Sebe,et al.  Exploiting the entire feature space with sparsity for automatic image annotation , 2011, ACM Multimedia.

[31]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[32]  Mohamed S. Kamel,et al.  An Efficient Greedy Method for Unsupervised Feature Selection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[33]  Zhiwu Lu,et al.  Multi-modal constraint propagation for heterogeneous image clustering , 2011, ACM Multimedia.

[34]  Xiaohua Zhai,et al.  Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval , 2012, MMM.

[35]  Nikos Paragios,et al.  Bag-of-multimedia-words for image classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[36]  Xiaohua Zhai,et al.  Cross-modality correlation propagation for cross-media retrieval , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).