Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback

Different types of multimedia data express high-level semantics from different aspects. How to learn comprehensive high-level semantics from different types of data and enable efficient cross-media retrieval becomes an emerging hot issue. There are abundant statistical and semantic correlations among heterogeneous low-level media content, which makes it challenging to query cross-media data effectively. In this paper, we propose a new cross-media retrieval method based on short-term and long-term relevance feedback. Our method mainly focuses on two typical types of media data, i.e. image and audio. First, we build multimodal representation via statistical canonical correlation between image and audio feature matrices, and define cross-media distance metric for similarity measure; then we propose optimization strategy based on relevance feedback, which fuses short-term learning results and long-term accumulated knowledge into the objective function. Experiments on image-audio dataset have demonstrated the superiority of our method over several existing algorithms.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[3]  Qi Tian,et al.  Learning image manifolds by semantic subspace projection , 2006, MM '06.

[4]  Songcan Chen,et al.  Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[5]  Jiawei Han,et al.  Semi-supervised Discriminant Analysis , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[6]  Yi Yang,et al.  Heterogeneous multimedia data semantics mining using content and location context , 2008, ACM Multimedia.

[7]  Jiebo Luo,et al.  Mining Compositional Features From GPS and Visual Cues for Event Recognition in Photo Collections , 2010, IEEE Transactions on Multimedia.

[8]  Svetha Venkatesh,et al.  Nonnegative shared subspace learning and its application to social media retrieval , 2010, KDD.

[9]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[10]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Nicu Sebe,et al.  Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks , 2013, IEEE Transactions on Multimedia.

[12]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[13]  Li Chen,et al.  Isomorphic and sparse multimodal data representation based on correlation analysis , 2013, 2013 IEEE International Conference on Image Processing.

[14]  Hong Zhang,et al.  Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval , 2013, Neurocomputing.

[15]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..