Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

[1]  N. Barnes,et al.  Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Stan Birchfield,et al.  Audio-Visual Segmentation , 2022, ECCV.

[3]  Weidi Xie,et al.  Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation , 2022, ACM Multimedia.

[4]  D. Clifton,et al.  Multimodal Learning With Transformers: A Survey , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  T. Tan,et al.  Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes , 2022, ArXiv.

[6]  S. Song,et al.  Vision Transformer with Deformable Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Nick Barnes,et al.  Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction , 2021, NeurIPS.

[8]  Ruihua Song,et al.  Class-Aware Sounding Objects Localization via Audiovisual Correspondence , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Huchuan Lu,et al.  Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[10]  P. Luo,et al.  PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.

[11]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Andrea Vedaldi,et al.  Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hung-Yu Tseng,et al.  Unsupervised Sound Localization via Iterative Contrastive Learning , 2021, Comput. Vis. Image Underst..

[14]  Parham Aarabi,et al.  SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[16]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[17]  B. Leibe,et al.  Making a Case for 3D Convolutions for Object Segmentation in Videos , 2020, BMVC.

[18]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[19]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[20]  Tao Kong,et al.  SOLOv2: Dynamic and Fast Instance Segmentation , 2020, NeurIPS.

[21]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[22]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[23]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[26]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[35]  Yuchao Dai,et al.  Transformer Transforms Salient Object Detection and Camouflaged Object Detection , 2021, ArXiv.

[36]  M. Pantic,et al.  Active Speaker Detection and Localization in Videos Using Low-Rank and Kernelized Sparsity , 2020, IEEE Signal Processing Letters.