Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Audio-Visual Video Parsing is a task to predict the events that occur in video segments for each modality. It often performs in a weakly supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known event labels for each modality. However, the labels are still limited to the video level, and the temporal boundaries of event timestamps remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the CLIP model to estimate the events in each video segment based on visual modality to generate segment-level pseudo labels. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs. We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy. Our method achieves state-of-the-art audio-visual video parsing performance.

[1]  Stan Birchfield,et al.  Audio-Visual Segmentation with Semantics , 2023, ArXiv.

[2]  Zhengjun Zha,et al.  Semantic and Relation Modulation for Audio-Visual Event Localization , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Tanvir Mahmud,et al.  AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4]  J. Allebach,et al.  Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[5]  Lingqiao Liu,et al.  ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation , 2022, ArXiv.

[6]  Weidi Xie,et al.  Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models , 2022, BMVC.

[7]  Fumin Shen,et al.  DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing , 2022, ACM Multimedia.

[8]  Qingming Huang,et al.  Span-based Audio-Visual Localization , 2022, ACM Multimedia.

[9]  Xin Wang,et al.  AVQA: A Dataset for Audio-Visual Question Answering on Videos , 2022, ACM Multimedia.

[10]  Yapeng Tian,et al.  Learning in Audio-visual Context: A Review, Analysis, and New Perspective , 2022, ArXiv.

[11]  Stan Birchfield,et al.  Audio-Visual Segmentation , 2022, ECCV.

[12]  Marcella Cornia,et al.  The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Zhou Zhao,et al.  Cross-modal Background Suppression for Audio-Visual Event Localization , 2022, Computer Vision and Pattern Recognition.

[14]  Jae Myung Kim,et al.  Large Loss Matters in Weakly Supervised Multi-Label Classification , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[16]  Chen Qian,et al.  Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing , 2022, ECCV.

[17]  Junyu Gao,et al.  Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ganesh Ramakrishnan,et al.  Investigating Modality Bias in Audio Visual Video Parsing , 2022, ArXiv.

[19]  Yapeng Tian,et al.  Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dengxin Dai,et al.  Decoupling Zero-Shot Semantic Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jiwen Lu,et al.  DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chen Change Loy,et al.  Extract Free Dense Labels from CLIP , 2021, ECCV.

[23]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yuejie Zhang,et al.  MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing , 2021, ACM Multimedia.

[25]  Varshanth R. Rao,et al.  Dual Perspective Network for Audio-Visual Event Localization , 2022, ECCV.

[26]  Fengyun Rao,et al.  CLIP4Caption: CLIP for Video Caption , 2021, ACM Multimedia.

[27]  Youngjae Yu,et al.  Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Gang Hua,et al.  Enriching Local and Global Contexts for Temporal Action Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yu Wu,et al.  Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Peng Hu,et al.  Learning Cross-Modal Retrieval with Noisy Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[32]  Abhishek,et al.  Cross-Modal learning for Audio-Visual Video Parsing , 2021, Interspeech.

[33]  Shijie Hao,et al.  Positive Sample Propagation along the Audio-Visual Event Line , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  R. Nevatia,et al.  SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[36]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[37]  Mubarak Shah,et al.  In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning , 2021, ICLR.

[38]  Quoc V. Le,et al.  Meta Pseudo Labels , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Fangyun Wei,et al.  A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model , 2021, ArXiv.

[40]  Ming-Hsuan Yang,et al.  Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing , 2021, NeurIPS.

[41]  Runhao Zeng,et al.  Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.

[42]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[43]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[44]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[45]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[46]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[47]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[48]  Yan Yan,et al.  Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.

[49]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Awni Y. Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[52]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Yu-Chiang Frank Wang,et al.  Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[59]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[61]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[62]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[64]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[69]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[70]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.