Contrastive Positive Sample Propagation Along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA<inline-formula><tex-math notation="LaTeX">$_{S}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mi>S</mml:mi></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq1-3223688.gif"/></alternatives></inline-formula> and PSA<inline-formula><tex-math notation="LaTeX">$_{V}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mi>V</mml:mi></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq2-3223688.gif"/></alternatives></inline-formula>). Three new contrastive objectives are proposed (i.e., <inline-formula><tex-math notation="LaTeX">$\mathcal {L}_{\text{avpsp}}$</tex-math><alternatives><mml:math><mml:msub><mml:mi mathvariant="script">L</mml:mi><mml:mtext>avpsp</mml:mtext></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq3-3223688.gif"/></alternatives></inline-formula>, <inline-formula><tex-math notation="LaTeX">$\mathcal {L}_ \text{spsa}$</tex-math><alternatives><mml:math><mml:msub><mml:mi mathvariant="script">L</mml:mi><mml:mtext>spsa</mml:mtext></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq4-3223688.gif"/></alternatives></inline-formula>, and <inline-formula><tex-math notation="LaTeX">$\mathcal {L}_\text{vpsa}$</tex-math><alternatives><mml:math><mml:msub><mml:mi mathvariant="script">L</mml:mi><mml:mtext>vpsa</mml:mtext></mml:msub></mml:math><inline-graphic xlink:href="wang-ieq5-3223688.gif"/></alternatives></inline-formula>) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

[1]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Tuka Alhanai,et al.  SupCL-Seq: Supervised Contrastive Learning for Downstream Optimized Sequence Representations , 2021, EMNLP.

[3]  Yu Wu,et al.  Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Weiran Xu,et al.  Modeling Discriminative Representations for Out-of-Domain Detection with Supervised Contrastive Learning , 2021, ACL.

[5]  Rui Feng,et al.  MPN: Multimodal Parallel Network for Audio-Visual Event Localization , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[6]  Shijie Hao,et al.  Positive Sample Propagation along the Audio-Visual Event Line , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yuexian Zou,et al.  CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[9]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[10]  Kristen Grauman,et al.  VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jingfei Du,et al.  Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , 2020, ICLR.

[12]  Yan Yan,et al.  Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  N. Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tae-Hyun Oh,et al.  Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ming-Hsuan Yang,et al.  Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing , 2021, NeurIPS.

[16]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[17]  Daniel McDuff,et al.  Contrastive Learning of Global and Local Audio-Visual Representations , 2021, ArXiv.

[18]  Runhao Zeng,et al.  Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.

[19]  Ira Kemelmacher-Shlizerman,et al.  The Cone of Silence: Speech Separation by Localization , 2020, NeurIPS.

[20]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[21]  Youngjung Uh,et al.  In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization , 2020, ACCV.

[22]  Ruize Wang,et al.  Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning , 2020, ACM Multimedia.

[23]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[24]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[25]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[26]  Derek Hoiem,et al.  Contrastive Learning for Weakly Supervised Phrase Grounding , 2020, ECCV.

[27]  Anurag Kumar,et al.  Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data , 2020, IJCAI.

[28]  Janani Ramaswamy,et al.  What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[31]  Yan Yan,et al.  Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.

[32]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[34]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yu-Chiang Frank Wang,et al.  Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization , 2020, ACCV.

[36]  Jieming Zhu,et al.  Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding , 2020, NeurIPS.

[37]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Hongdong Li,et al.  Noise-Aware Unsupervised Deep Lidar-Stereo Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Yu-Chiang Frank Wang,et al.  Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Naji Khosravan,et al.  On Attention Modules for Audio-Visual Synchronization , 2018, CVPR Workshops.

[44]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.

[46]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[47]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[48]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[49]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[52]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[53]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[54]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[55]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Maja Pantic,et al.  Audio-visual object localization and separation using low-rank and sparsity , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[67]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[68]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[70]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[71]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[73]  Marcin Kozak,et al.  “A Dendrite Method for Cluster Analysis” by Caliński and Harabasz: A Classical Work that is Far Too Often Incorrectly Cited , 2012 .

[74]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[75]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[76]  Trevor Darrell,et al.  Ausio-visual Segmentation and "The Cocktail Party Effect" , 2000, ICMI.

[77]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[78]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[79]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .