Learning Unseen Modality Interaction

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.

[1]  Andrew Zisserman,et al.  Zorro: the masked multimodal transformer , 2023, ArXiv.

[2]  Yi Huang,et al.  Cross-Modal Federated Human Activity Recognition via Modality-Agnostic and Modality-Specific Representation Learning , 2022, AAAI.

[3]  Jiantao Zhou,et al.  Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities , 2022, SIGIR.

[4]  Wen-bing Huang,et al.  Multimodal Token Fusion for Vision Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xi Peng,et al.  Are Multimodal Transformers Robust to Missing Modality? , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  L. V. D. Maaten,et al.  Omnivore: A Single Model for Many Visual Modalities , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  James R. Glass,et al.  Everything at Once – Multi-modal Fusion Transformer for Video Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Ma,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[9]  C. Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.

[10]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yu Wu,et al.  Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[13]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[14]  Nuno Vasconcelos,et al.  Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ling Shao,et al.  Repetitive Activity Counting by Sight and Sound , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Sergey Tulyakov,et al.  SMIL: Multimodal Learning with Severely Missing Modality , 2021, AAAI.

[17]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[18]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[19]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[20]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[21]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[22]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[25]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[26]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[27]  Qiang Yang,et al.  Federated Machine Learning , 2019, ACM Trans. Intell. Syst. Technol..

[28]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[29]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[31]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[32]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[34]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[36]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[37]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[38]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Daniel McDuff,et al.  Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.

[41]  Qin Jin,et al.  Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities , 2021, ACL.