Sparse Fusion for Multimodal Transformers

Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling. Evaluations are conducted on multiple multimodal benchmark datasets for a wide range of classification tasks. State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements. Extensive ablation studies showcase our benefits of combining sparsification and multimodal learning over naive approaches. This paves the way for enabling multimodal learning on low-resource devices.

[1]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[2]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[5]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[7]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[9]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[10]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[12]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition , 2021, NeurIPS.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[15]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[16]  Angela Dai,et al.  TransformerFusion: Monocular RGB Scene Reconstruction using Transformers , 2021, NeurIPS.

[17]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[18]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[19]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[20]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[21]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[22]  Jianfei Cai,et al.  Scalable Vision Transformers with Hierarchical Pooling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Cordelia Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, ArXiv.

[24]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[25]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[26]  Yi Ding,et al.  Augmentation Strategies for Learning with Noisy Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[28]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[29]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[30]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[31]  Minjia Zhang,et al.  Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[32]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Anamitra R. Choudhury,et al.  PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[34]  Zheng Zhang,et al.  BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[35]  Tobias Höllerer,et al.  VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion , 2021, 2021 International Conference on 3D Vision (3DV).