Heterogeneous multimedia cooperative annotation based on multimodal correlation learning

Abstract Rich multimedia contents are dominating the current Web. In popular social media platforms such as FaceBook, Twitter, and Instagram, there are over millions of multimedia contents being created by users. In the meantime, multimedia data consists of data in multiple modalities, such as text, images, videos, audio, time series sequences, and so on. Many research efforts have been devoted to multimedia annotation to further improve the performance. However, the prevailing methods are designed for single-media annotation task. In fact, heterogeneous media content describes given labels from respective modality and is complementary to each other, and it becomes critical to explore advanced techniques for heterogeneous data analysis and multimedia annotation. Inspired by this idea, this paper presents a new multimodal correlation learning method for heterogeneous multimedia cooperative annotation, named unified space learning, which projects heterogeneous media data into one unified space. We formulate the multimedia annotation task into a semi-supervised learning framework, in which we learn different projection matrices for different media type. By doing so, different media content is aligned cooperatively, and jointly provides a more complementary profile of given semantic labels. Experimental results on data set with images, audio clips, videos and 3D models show that the proposed approach is more effective.

[1]  Kristen Grauman,et al.  Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search , 2011, International Journal of Computer Vision.

[2]  Nikos Paragios,et al.  Bag-of-multimedia-words for image classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[3]  Raman Arora,et al.  Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Andreas Bartels,et al.  Semi-supervised kernel canonical correlation analysis with application to human fMRI , 2011, Pattern Recognit. Lett..

[6]  Xindong Wu,et al.  Multi-View Visual Classification via a Mixed-Norm Regularizer , 2013, PAKDD.

[7]  Meng Wang,et al.  Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction , 2018, IEEE Transactions on Knowledge and Data Engineering.

[8]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[9]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[10]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[12]  Lei Wu,et al.  Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Feng Tian,et al.  Multimedia automatic annotation by mining label set correlation , 2018, Multimedia Tools and Applications.

[14]  Wolfgang Nejdl,et al.  An adaptive teleportation random walk model for learning social tag relevance , 2014, SIGIR.

[15]  Xuelong Li,et al.  Block-Row Sparse Multiview Multilabel Learning for Image Classification , 2016, IEEE Transactions on Cybernetics.

[16]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[17]  Xuelong Li,et al.  Modeling Disease Progression via Multisource Multitask Learners: A Case Study With Alzheimer’s Disease , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Luming Zhang,et al.  Multiview Physician-Specific Attributes Fusion for Health Seeking , 2017, IEEE Transactions on Cybernetics.

[19]  Giovanni Maria Farinella,et al.  Bags of phrases with codebooks alignment for near duplicate image detection , 2010, MiFor '10.

[20]  Hsuan-Tien Lin,et al.  Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement , 2012, IEEE Transactions on Multimedia.

[21]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[22]  Xukun Shen,et al.  Automatic image annotation with real-world community contributed data set , 2019, Multimedia Systems.

[23]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[24]  Xukun Shen,et al.  Learning Semantic Concepts from Noisy Media Collection for Automatic Image Annotation , 2015 .

[25]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[26]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[27]  Qiang Wang,et al.  Joint graph regularization based modality-dependent cross-media retrieval , 2018, Multimedia Tools and Applications.

[28]  Meng Wang,et al.  Harvesting visual concepts for image search with complex queries , 2012, ACM Multimedia.

[29]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[30]  Yang Yang,et al.  Learning Features from Large-Scale, Noisy and Social Image-Tag Collection , 2015, ACM Multimedia.

[31]  Meng Wang,et al.  A Framework of Joint Low-Rank and Sparse Regression for Image Memorability Prediction , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Shichao Zhang,et al.  Self-representation nearest neighbor search for classification , 2016, Neurocomputing.

[33]  Tao Mei,et al.  Image tag refinement by regularized latent Dirichlet allocation , 2013, Comput. Vis. Image Underst..

[34]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.