论文信息 - Cross-modal Learning for Multi-modal Video Categorization

Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple modalities (e.g., video, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding, activity recognition). In this paper, we focus on the problem of video categorization using a multi-modal ML technique. In particular, we have developed a novel multi-modal ML approach that we call "cross-modal learning", where one modality influences another but only when there is correlation between the modalities -- for that, we first train a correlation tower that guides the main multi-modal video categorization tower in the model. We show how this cross-modal principle can be applied to different types of models (e.g., RNN, Transformer, NetVLAD), and demonstrate through experiments how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.

Palash Goyal | Saurabh Sahu | Shalini Ghosh | Chul Lee

[1] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[2] Vinod Yegneswaran,et al. Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem , 2017, KDD.

[3] Tianqi Liu,et al. BERT for Large-scale Video Segment Classification with Test-time Augmentation , 2019, ArXiv.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Bruce A. Draper,et al. Gesture Recognition: Focus on the Hands , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Qifeng Chen,et al. Fully Automatic Video Colorization With Self-Regularization and Diversity , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .

[8] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.

[9] Larry P. Heck,et al. Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation , 2019, ArXiv.

[10] Larry P. Heck,et al. Efficient Incremental Learning for Mobile Object Detection , 2019, ArXiv.

[11] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[12] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Alexander C. Loui,et al. Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[15] Yingyu Liang,et al. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[16] Benoit Huet,et al. Fusion of Multimodal Embeddings for Ad-Hoc Video Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[17] Geoffrey E. Hinton,et al. Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[18] Silvio Savarese,et al. Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[19] Basavaraj A. Goudannavar,et al. Correlation analysis of audio and video contents: A metadata based approach , 2015, 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT).

[20] Hongxia Jin,et al. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Justin Dauwels,et al. EduBrowser: A Multimodal Automated Monitoring System for Co-located Collaborative Learning , 2019, LTEC.

[22] Ian Davidson,et al. A Framework for Deep Constrained Clustering - Algorithms and Advances , 2019, ECML/PKDD.

[23] Juan Carlos Niebles,et al. What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Henry A. Kautz,et al. Combining Subjective Probabilities and Data in Training Markov Logic Networks , 2012, ECML/PKDD.

[25] Xiao Liu,et al. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[27] Jinfeng Yi,et al. AdvIT: Adversarial Frames Identifier Based on Temporal Consistency in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Vladlen Koltun,et al. Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Rui Wang,et al. Virtual Reality Scene Construction Based on Multimodal Video Scene Segmentation Algorithm , 2019, 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC).

[30] Ivan Marsic,et al. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[31] Marcel Worring,et al. Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[32] Yalin Wang,et al. Regularize, Expand and Compress: NonExpansive Continual Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[35] Gaël Richard,et al. On the Correlation of Automatic Audio and Visual Segmentations of Music Videos , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[36] Eric Granger,et al. Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.

[37] Tomás Pajdla,et al. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Larry P. Heck,et al. Contextual LSTM (CLSTM) models for Large scale NLP tasks , 2016, ArXiv.

[39] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).