MutualFormer: Multi-Modality Representation Learning via Mutual Transformer

Aggregating multi-modality data to obtain accurate and reliable data representation attracts more and more attention. The pristine researchers generally adopt the CNN to extract features of independent modality and aggregate them with a fusion module. However, the overall performance is becoming saturated due to limited local convolutional features. Recent studies demonstrate that Transformer models usually work comparable or even better than CNN for multi-modality task, but they simply adopt concatenation or cross-attention for feature fusion which may just obtain sub-optimal results. In this work, we re-thinking the self-attention based Transformer and propose a novel MutualFormer for multi-modality data fusion and representation. The core of MutualFormer is the design of both token mixer and modality mixer to conduct the communication among both tokens and modalities. Specifically, it contains three main modules, i.e., i) Self-attention (SA) for intra-modality token mixer, ii) Cross-diffusion attention (CDA) for inter-modality mixer and iii) Aggregation module. The main advantage of the proposed CDA is that it is defined based on individual domain similarities in the metric space which thus can naturally avoid the issue of domain/modality gap in cross-modality similarities computation. We successfully apply the MutualFormer into the saliency detection problem and propose a novel approach to obtain the reinforced features of RGB and Depth images. Extensive experiments on six popular datasets demonstrate that our model achieves comparable results with 16 SOTA models.

[1]  Anjith George,et al.  Cross Modal Focal Loss for RGBD Face Anti-Spoofing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[3]  Hong Qin,et al.  Depth-Quality-Aware Salient Object Detection , 2021, IEEE Transactions on Image Processing.

[4]  Haibin Ling,et al.  Saliency Detection on Light Field , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ali Borji,et al.  Salient Object Detection: A Benchmark , 2015, IEEE Transactions on Image Processing.

[6]  Ling Shao,et al.  Visual Saliency Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Qi Tian,et al.  Regularized Diffusion Process on Bidirectional Context for Object Retrieval , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Chenyang Si,et al.  MetaFormer is Actually What You Need for Vision , 2021, ArXiv.

[10]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Qijun Zhao,et al.  Siamese Network for RGB-D Salient Object Detection and Beyond , 2020, ArXiv.

[12]  Zheng Lin,et al.  Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Bhiksha Raj,et al.  The Right to Talk: An Audio-Visual Transformer Approach , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Bin Liu,et al.  Multimodal Transformer Fusion for Continuous Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yongri Piao,et al.  A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Qingming Huang,et al.  F3Net: Fusion, Feedback and Focus for Salient Object Detection , 2019, AAAI.

[18]  Weisi Lin,et al.  Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection , 2021, IEEE Transactions on Image Processing.

[19]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[20]  Yang Cao,et al.  Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yang Wang,et al.  Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation , 2016, ISVC.

[22]  Wei Ji,et al.  Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Nick Barnes,et al.  Transformer Transforms Salient Object Detection and Camouflaged Object Detection , 2021, ArXiv.

[24]  Wei Ji,et al.  Accurate RGB-D Salient Object Detection via Collaborative Learning , 2020, ECCV.

[25]  Huazhu Fu,et al.  Accelerated Multi-Modal MR Imaging with Transformers , 2021, ArXiv.

[26]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Thomas Brox,et al.  CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Xueqing Li,et al.  Leveraging stereopsis for saliency analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yongri Piao,et al.  Select, Supplement and Focus for RGB-D Saliency Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Junwei Han,et al.  CNNs-Based RGB-D Saliency Detection via Cross-View Transfer and Multiview Fusion. , 2018, IEEE transactions on cybernetics.

[31]  Qi Zhang,et al.  Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Bolei Zhou,et al.  Multimodal Motion Prediction with Stacked Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Qijun Zhao,et al.  RGB-D Salient Object Detection via 3D Convolutional Neural Networks , 2021, AAAI.

[36]  Yi Zhang,et al.  CDNet: Complementary Depth Network for RGB-D Salient Object Detection , 2021, IEEE Transactions on Image Processing.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Xiaoning Song,et al.  MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition , 2021, ACL.

[39]  Qi Bi,et al.  Calibrated RGB-D Salient Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[41]  Bo Ren,et al.  Enhanced-alignment Measure for Binary Foreground Map Evaluation , 2018, IJCAI.

[42]  Rongrong Ji,et al.  RGBD Salient Object Detection: A Benchmark and Algorithms , 2014, ECCV.

[43]  Yuan Wang,et al.  TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network , 2021, ACM Multimedia.

[44]  Junwei Han,et al.  Learning Selective Self-Mutual Attention for RGB-D Saliency Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yael Pritch,et al.  Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[47]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[48]  Xipeng Qiu,et al.  FLAT: Chinese NER Using Flat-Lattice Transformer , 2020, ACL.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yin Dai,et al.  TransMed: Transformers Advance Multi-Modal Medical Image Classification , 2021, Diagnostics.

[51]  Huchuan Lu,et al.  Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection , 2020, ECCV.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Yongri Piao,et al.  Feature Reintegration over Differential Treatment: A Top-down and Adaptive Fusion Network for RGB-D Salient Object Detection , 2020, ACM Multimedia.

[54]  Sergio Escalera,et al.  Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[55]  Tolga Çukur,et al.  ResViT: Residual vision transformers for multi-modal medical image synthesis , 2021, ArXiv.

[56]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[57]  Ran Ju,et al.  Depth saliency based on anisotropic center-surround difference , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[58]  Cordelia Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, ArXiv.

[59]  Siwei Lyu,et al.  Cascade Graph Neural Networks for RGB-D Salient Object Detection , 2020, ECCV.

[60]  Runmin Cong,et al.  RGB-D Salient Object Detection with Cross-Modality Modulation and Selection , 2020, ECCV.

[61]  Sabine Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Tao Li,et al.  Structure-Measure: A New Way to Evaluate Foreground Maps , 2017, International Journal of Computer Vision.