Dynamic Multimodal Fusion

Deep multimodal learning has achieved great progress in recent years. However, current fusion approaches are static in nature, i.e., they process and fuse multimodal inputs with identical computation, without accounting for diverse computational demands of different multimodal data. In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. To this end, we propose a gating function to provide modality-level or fusion-level decisions on-the-fly based on multimodal features and a resource-aware loss function that encourages computational efficiency. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach. For instance, DynMM can reduce the computation costs by 46.5% with only a negligible accuracy loss (CMU-MOSEI sentiment analysis) and improve segmentation performance with over 21% savings in computation (NYU Depth V2 semantic segmentation) when compared with static fusion approaches. We believe our approach opens a new direction towards dynamic multimodal network design, with applications to a wide range of multimodal tasks.

[1]  Junzhou Huang,et al.  Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yu Chen,et al.  Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection , 2022, ECCV.

[3]  Paul Pu Liang,et al.  MultiBench: Multiscale Benchmarks for Multimodal Representation Learning , 2021, NeurIPS Datasets and Benchmarks.

[4]  C. Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.

[5]  Kate Saenko,et al.  AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Songyuan Li,et al.  Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gao Huang,et al.  Dynamic Neural Networks: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yuke Zhu,et al.  Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Horst-Michael Groß,et al.  Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Fuchun Sun,et al.  Deep Multimodal Fusion by Channel Exchanging , 2020, NeurIPS.

[11]  Le Yang,et al.  Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification , 2020, NeurIPS.

[12]  Xiaokang Chen,et al.  Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation , 2020, ECCV.

[13]  Yee Whye Teh,et al.  Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[14]  D. Tao,et al.  Deep Multimodal Neural Architecture Search , 2020, ACM Multimedia.

[15]  Xiangyu Zhang,et al.  Learning Dynamic Routing for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  K. Grauman,et al.  Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hamid Reza Vaezi Joze,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xinxin Hu,et al.  ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[19]  Beng Chin Ooi,et al.  Dynamic Routing Networks , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Qiang Ling,et al.  Adaptive Convolution for Object Detection , 2019, IEEE Transactions on Multimedia.

[21]  Ryoma Bise,et al.  Adaptive Weighting Multi-Field-Of-View CNN for Semantic Segmentation in Pathology , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Quoc V. Le,et al.  CondConv: Conditionally Parameterized Convolutions for Efficient Inference , 2019, NeurIPS.

[23]  Ramandeep Kaur,et al.  Multimodal Sentiment Analysis: A Survey and Comparison , 2019, Int. J. Serv. Sci. Manag. Eng. Technol..

[24]  Frédéric Jurie,et al.  MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ian D. Reid,et al.  Light-Weight RefineNet for Real-Time Semantic Segmentation , 2018, BMVC.

[26]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[27]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[28]  Noam Shazeer,et al.  HydraNets: Specialized Dynamic Architectures for Efficient Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Ning Xu,et al.  Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[30]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[31]  Xin Wang,et al.  SkipNet: Learning Dynamic Routing in Convolutional Networks , 2017, ECCV.

[32]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[34]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[35]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Mohammad Soleymani,et al.  A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[37]  Jianxin Wu,et al.  Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[39]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Venkatesh Saligrama,et al.  Adaptive Neural Networks for Efficient Inference , 2017, ICML.

[41]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[42]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[43]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[45]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[46]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[47]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[50]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[51]  Reza Ebrahimpour,et al.  Mixture of experts: a literature survey , 2014, Artificial Intelligence Review.

[52]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.