A cross-media distance metric learning framework based on multi-view correlation mining and matching

With the explosion of multimedia data, it is usual that different multimedia data often coexist in web repositories. Accordingly, it is more and more important to explore underlying intricate cross-media correlation instead of single-modality distance measure so as to improve multimedia semantics understanding. Cross-media distance metric learning focuses on correlation measure between multimedia data of different modalities. However, the existence of content heterogeneity and semantic gap makes it very challenging to measure cross-media distance. In this paper, we propose a novel cross-media distance metric learning framework based on sparse feature selection and multi-view matching. First, we employ sparse feature selection to select a subset of relevant features and remove redundant features for high-dimensional image features and audio features. Secondly, we maximize the canonical coefficient during image-audio feature dimension reduction for cross-media correlation mining. Thirdly, we further construct a Multi-modal Semantic Graph to find embedded manifold cross-media correlation. Moreover, we fuse the canonical correlation and the manifold information into multi-view matching which harmonizes different correlations with an iteration process and build Cross-media Semantic Space for cross-media distance measure. The experiments are conducted on image-audio dataset for cross-media retrieval. Experiment results are encouraging and show that the performance of our approach is effective.

[1]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[2]  Florentin Wörgötter,et al.  Advances in Neural Information Processing Systems 16 (NIPS 2003) , 2004 .

[3]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[4]  Yue Gao,et al.  Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval , 2013, ACM Multimedia.

[5]  Meng Wang,et al.  Adaptive Hypergraph Learning and its Application in Image Classification , 2012, IEEE Transactions on Image Processing.

[6]  Zhao Wang,et al.  Adaptive multi-view feature selection for human motion retrieval , 2016, Signal Process..

[7]  Yueting Zhuang,et al.  Cross-modal correlation learning for clustering on image-audio dataset , 2007, ACM Multimedia.

[8]  Songcan Chen,et al.  Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[9]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[10]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[11]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[12]  Jieping Ye,et al.  Feature grouping and selection over an undirected graph , 2012, KDD.

[13]  Yueting Zhuang,et al.  Sparse Unsupervised Dimensionality Reduction for Multiple View Data , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Katsumi Tanaka,et al.  Complementary information retrieval for cross-media news content , 2006, Inf. Syst..

[15]  Meng Wang,et al.  Semi-supervised distance metric learning based on local linear regression for data clustering , 2012, Neurocomputing.

[16]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[17]  Yueting Zhuang,et al.  Active post-refined multimodality video semantic concept detection with tensor representation , 2008, ACM Multimedia.

[18]  Qinghua Hu,et al.  What Can We Learn about Motion Videos from Still Images? , 2014, ACM Multimedia.

[19]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[20]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[21]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[22]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[23]  Yi Yang,et al.  Image Attribute Adaptation , 2014, IEEE Transactions on Multimedia.

[24]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[25]  Nicu Sebe,et al.  Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks , 2013, IEEE Transactions on Multimedia.

[26]  Hong Zhang,et al.  Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval , 2013, Neurocomputing.

[27]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[29]  Jeff Shrager,et al.  Observation of Phase Transitions in Spreading Activation Networks , 1987, Science.

[30]  Yueting Zhuang,et al.  Adaptive Unsupervised Multi-view Feature Selection for Visual Concept Recognition , 2012, ACCV.

[31]  Yueting Zhuang,et al.  Fast view-based 3D model retrieval via unsupervised multiple feature fusion and online projection learning , 2016, Signal Process..

[32]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[33]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[34]  Yongdong Zhang,et al.  Explicit and implicit concept-based video retrieval with bipartite graph propagation model , 2010, ACM Multimedia.

[35]  Yueting Zhuang,et al.  Multi-Label Transfer Learning With Sparse Representation , 2010, IEEE Transactions on Circuits and Systems for Video Technology.

[36]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Ivor W. Tsang,et al.  Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets , 2010, ICML.

[38]  Svetha Venkatesh,et al.  Nonnegative shared subspace learning and its application to social media retrieval , 2010, KDD.