论文信息 - Multi-Modal Deep Analysis for Multimedia

Multi-Modal Deep Analysis for Multimedia

With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

[1] Fei Wang,et al. Ieee Transactions on Knowledge and Data Engineering, Manuscropt Id 1 Social Recommendation with Cross-domain Transferable Knowledge , 2022 .

[2] Lin Wu,et al. Unsupervised Metric Fusion Over Multiview Data by Graph Random Walk-Based Cross-View Diffusion , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[3] Jonathan Masci,et al. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Fuzhen Zhuang,et al. Supervised Representation Learning: Transfer Learning with Deep Autoencoders , 2015, IJCAI.

[5] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[6] Jianmin Wang,et al. Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Qi Gao,et al. Analyzing Cross-System User Modeling on the Social Web , 2011, ICWE.

[8] Nikos Paragios,et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[10] G. C. Tiao,et al. Bayesian inference in statistical analysis , 1973 .

[11] Meng Wang,et al. Cross-Modality Feature Learning via Convolutional Autoencoder , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[12] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.

[13] Mohamed R. Amer,et al. Multimodal fusion using dynamic hybrid models , 2014, IEEE Winter Conference on Applications of Computer Vision.

[14] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[15] Graham W. Taylor,et al. Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[16] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[17] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Kenji Fukumizu,et al. Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[19] Dacheng Tao,et al. Robust Face Recognition via Multimodal Deep Face Representation , 2015, IEEE Transactions on Multimedia.

[20] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21] Subramanian Ramanathan,et al. No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion , 2013, 2013 IEEE International Conference on Computer Vision.

[22] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[23] Christian Wolf,et al. ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[24] Qiang Yang,et al. Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction , 2009, IJCAI.

[25] Changsheng Xu,et al. Cross-Domain Collaborative Learning in Social Multimedia , 2015, ACM Multimedia.

[26] Mubarak Shah,et al. Query-Focused Extractive Video Summarization , 2016, ECCV.

[27] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[28] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[29] Kate Saenko,et al. Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[30] Zhongqi Lu,et al. Selective Transfer Learning for Cross Domain Recommendation , 2012, SDM.

[31] Zhibin Hong,et al. Tracking via Robust Multi-task Multi-view Joint Sparse Representation , 2013, 2013 IEEE International Conference on Computer Vision.

[32] Costanza Navarretta,et al. Transfer learning in multimodal corpora , 2013, 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom).

[33] Heng Ji,et al. Event Specific Multimodal Pattern Mining for Knowledge Base Construction , 2016, ACM Multimedia.

[34] Yao Hu,et al. Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[35] Qi Wu,et al. FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Guiguang Ding,et al. Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Wu-Jun Li,et al. Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yu Zheng,et al. Urban Water Quality Prediction Based on Multi-Task Multi-View Learning , 2016, IJCAI.

[41] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[42] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[43] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[44] Wei Liu,et al. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47] Tao Mei,et al. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[49] Wenwu Zhu,et al. Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks , 2017, ArXiv.

[50] Yi Zhen,et al. A probabilistic model for multimodal hash function learning , 2012, KDD.

[51] Byoung-Tak Zhang,et al. Multimodal Residual Learning for Visual QA , 2016, NIPS.

[52] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53] Tao Mei,et al. Video Summarization by Learning Deep Side Semantic Embedding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[54] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[55] Daniel Roggen,et al. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[56] Meng Wang,et al. Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[57] Haohan Wang,et al. Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition , 2014 .

[58] Raghavendra Udupa,et al. Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[59] Tao Mei,et al. SocialTransfer: cross-domain transfer learning from social streams for media applications , 2012, ACM Multimedia.

[60] Tony Jebara,et al. Multitask Sparsity via Maximum Entropy Discrimination , 2011, J. Mach. Learn. Res..

[61] Daoqiang Zhang,et al. Multimodal Multi-label Transfer Learning for Early Diagnosis of Alzheimer's Disease , 2015, MLMI.

[62] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[63] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[64] Xiaolong Jin,et al. Cross-Domain Recommendation: An Embedding and Mapping Approach , 2017, IJCAI.

[65] Shiliang Sun,et al. A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[66] Seungjin Choi,et al. Sequential Spectral Learning to Hash with Multiple Representations , 2012, ECCV.

[67] Yoshua Bengio,et al. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[68] Philip S. Yu,et al. Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[69] Tao Mei,et al. A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization , 2014, IEEE Transactions on Multimedia.

[70] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[71] Lakhmi C. Jain,et al. Introduction to Bayesian Networks , 2008 .

[72] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[73] Massimiliano Pontil,et al. Regularized multi--task learning , 2004, KDD.

[74] Shou-De Lin,et al. A Transfer Probabilistic Collective Factorization Model to Handle Sparse Data in Collaborative Filtering , 2014, 2014 IEEE International Conference on Data Mining.

[75] Stan Z. Li,et al. Shared representation learning for heterogenous face recognition , 2014, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[76] Yi Zhen,et al. Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[77] Sarah Parisot,et al. Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.

[78] Trevor Darrell,et al. Simultaneous Deep Transfer Across Domains and Tasks , 2015, ICCV.

[79] Zi Huang,et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[80] Xiaoqing Feng,et al. Multimodal video classification with stacked contractive autoencoders , 2016, Signal Process..

[81] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82] Yuxin Peng,et al. SCH-GAN: Semi-Supervised Cross-Modal Hashing by Generative Adversarial Network , 2018, IEEE Transactions on Cybernetics.

[83] Jiebo Luo,et al. Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[84] Tao Mei,et al. Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Dongqing Zhang,et al. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[86] Mohan S. Kankanhalli,et al. Automatic music video summarization based on audio-visual-text analysis and alignment , 2005, SIGIR '05.

[87] Michael I. Jordan,et al. Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[88] Yuxin Peng,et al. Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification , 2018, IJCAI.

[89] Christos Faloutsos,et al. MMSS : graph-based multi-modal story-oriented video summarization and retrieval , 2004 .

[90] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[91] Zhou Yu,et al. Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[92] Ling Shao,et al. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93] Chunhua Shen,et al. Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[94] Trevor Darrell,et al. Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[95] Nicholas Jing Yuan,et al. Little Is Much: Bridging Cross-Platform Behaviors through Overlapped Crowds , 2016, AAAI.

[96] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[97] Arthur P. Dempster,et al. A Generalization of Bayesian Inference , 1968, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[98] M. Shamim Hossain,et al. Cross-Platform Multi-Modal Topic Modeling for Personalized Inter-Platform Recommendation , 2015, IEEE Transactions on Multimedia.

[99] Zhou Yu,et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[100] Heng Ji,et al. Cross-media Event Extraction and Recommendation , 2016, NAACL.

[101] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[102] Dhruv Batra,et al. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[103] Lei Zhang,et al. PatternNet: Visual Pattern Mining with Deep Neural Network , 2018, ICMR.

[104] Changsheng Xu,et al. Unified YouTube Video Recommendation via Cross-network Collaboration , 2015, ICMR.

[105] Hui Chen,et al. TLRec:Transfer Learning for Cross-Domain Recommendation , 2017, 2017 IEEE International Conference on Big Knowledge (ICBK).

[106] Kristen Grauman,et al. Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[107] Juan Carlos Niebles,et al. Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[108] Guiguang Ding,et al. Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[109] Yizhou Wang,et al. Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[110] Heng Ji,et al. Improving Event Extraction via Multimodal Integration , 2017, ACM Multimedia.

[111] Victor Lavrenko,et al. Regularised Cross-Modal Hashing , 2015, SIGIR.

[112] Petros Maragos,et al. Video event detection and summarization using audio, visual and text saliency , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[113] Liang Wang,et al. Unconstrained Multimodal Multi-Label Learning , 2015, IEEE Transactions on Multimedia.

[114] David Mascharka,et al. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[115] Heng Ji,et al. Cross-document Event Coreference Resolution based on Cross-media Features , 2015, EMNLP.

[116] Massimiliano Pontil,et al. Convex multi-task feature learning , 2008, Machine Learning.

[117] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[118] Qi Wu,et al. Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[119] George Trigeorgis,et al. Domain Separation Networks , 2016, NIPS.

[120] Kilian Q. Weinberger,et al. Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[121] Meng Wang,et al. Topic driven multimodal similarity learning with multi-view voted convolutional features , 2018, Pattern Recognit..

[122] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[123] Zhou Yu,et al. Sparse Multi-Modal Hashing , 2014, IEEE Transactions on Multimedia.

[124] Roksana Boreli,et al. Is more always merrier?: a deep dive into online social footprints , 2012, WOSN '12.

[125] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[126] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[127] Yale Song,et al. TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[128] Yu Zhang,et al. A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[129] Xuelong Li,et al. Deep Binary Reconstruction for Cross-Modal Hashing , 2017, IEEE Transactions on Multimedia.

[130] Changsheng Xu,et al. Mining Cross-network Association for YouTube Video Promotion , 2014, ACM Multimedia.

[131] Masahiro Suzuki,et al. Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[132] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[133] Fei Wang,et al. Composite hashing with multiple information sources , 2011, SIGIR.

[134] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[135] Ling Shao,et al. Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[136] Jun Wang,et al. Comparing apples to oranges: a scalable solution with heterogeneous hashing , 2013, KDD.

[137] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[138] Jingrui He,et al. A Graphbased Framework for Multi-Task Multi-View Learning , 2011, ICML.

[139] Mohamed R. Amer,et al. Deep Multimodal Fusion: A Hybrid Approach , 2017, International Journal of Computer Vision.

[140] Degui Xiao,et al. Medical Image Retrieval: A Multimodal Approach , 2014, Cancer informatics.

[141] Wenwu Zhu,et al. Deep Asymmetric Transfer Network for Unbalanced Domain Adaptation , 2018, AAAI.

[142] Antonio Torralba,et al. Spectral Hashing , 2008, NIPS.

[143] Zi Huang,et al. Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[144] Jiwen Lu,et al. Cross-Modal Deep Variational Hashing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[145] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[146] Ambedkar Dukkipati,et al. Variational methods for conditional multimodal deep learning , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[147] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[148] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[149] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[150] Boqing Gong,et al. Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[151] Zi Huang,et al. Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[152] Razvan Pascanu,et al. Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[153] Liang Lin,et al. Visual Question Reasoning on General Dependency Tree , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[154] Zi Huang,et al. Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[155] Sebastian Ruder,et al. An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[156] Wenwu Zhu,et al. Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[157] Daoqiang Zhang,et al. Multimodal manifold-regularized transfer learning for MCI conversion prediction , 2015, Brain Imaging and Behavior.

[158] Anton van den Hengel,et al. Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[159] Chong-Wah Ngo,et al. Scalable Visual Instance Mining with Threads of Features , 2014, ACM Multimedia.

[160] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[161] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[162] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[163] Xin Wang,et al. Disparity-preserved Deep Cross-platform Association for Cross-platform Video Recommendation , 2019, IJCAI.

[164] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[165] Jürgen Schmidhuber,et al. Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[166] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[167] Jianmin Wang,et al. Correlation Autoencoder Hashing for Supervised Cross-Modal Search , 2016, ICMR.