Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple modalities (e.g., video, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding, activity recognition). In this paper, we focus on the problem of video categorization using a multi-modal ML technique. In particular, we have developed a novel multi-modal ML approach that we call "cross-modal learning", where one modality influences another but only when there is correlation between the modalities -- for that, we first train a correlation tower that guides the main multi-modal video categorization tower in the model. We show how this cross-modal principle can be applied to different types of models (e.g., RNN, Transformer, NetVLAD), and demonstrate through experiments how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.

[1]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[2]  Vinod Yegneswaran,et al.  Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem , 2017, KDD.

[3]  Tianqi Liu,et al.  BERT for Large-scale Video Segment Classification with Test-time Augmentation , 2019, ArXiv.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Bruce A. Draper,et al.  Gesture Recognition: Focus on the Hands , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Qifeng Chen,et al.  Fully Automatic Video Colorization With Self-Regularization and Diversity , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[8]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[9]  Larry P. Heck,et al.  Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation , 2019, ArXiv.

[10]  Larry P. Heck,et al.  Efficient Incremental Learning for Mobile Object Detection , 2019, ArXiv.

[11]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[12]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Alexander C. Loui,et al.  Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[15]  Yingyu Liang,et al.  Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[16]  Benoit Huet,et al.  Fusion of Multimodal Embeddings for Ad-Hoc Video Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[17]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[18]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[19]  Basavaraj A. Goudannavar,et al.  Correlation analysis of audio and video contents: A metadata based approach , 2015, 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT).

[20]  Hongxia Jin,et al.  Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Justin Dauwels,et al.  EduBrowser: A Multimodal Automated Monitoring System for Co-located Collaborative Learning , 2019, LTEC.

[22]  Ian Davidson,et al.  A Framework for Deep Constrained Clustering - Algorithms and Advances , 2019, ECML/PKDD.

[23]  Juan Carlos Niebles,et al.  What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Henry A. Kautz,et al.  Combining Subjective Probabilities and Data in Training Markov Logic Networks , 2012, ECML/PKDD.

[25]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[27]  Jinfeng Yi,et al.  AdvIT: Adversarial Frames Identifier Based on Temporal Consistency in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Vladlen Koltun,et al.  Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Rui Wang,et al.  Virtual Reality Scene Construction Based on Multimodal Video Scene Segmentation Algorithm , 2019, 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC).

[30]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[31]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[32]  Yalin Wang,et al.  Regularize, Expand and Compress: NonExpansive Continual Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[35]  Gaël Richard,et al.  On the Correlation of Automatic Audio and Visual Segmentations of Music Videos , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[36]  Eric Granger,et al.  Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.

[37]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Larry P. Heck,et al.  Contextual LSTM (CLSTM) models for Large scale NLP tasks , 2016, ArXiv.

[39]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).