论文信息 - Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.

A. Etemad | Pritam Sarkar

[1] Ali Etemad,et al. Self-supervised Contrastive Learning of Multi-view Facial Expressions , 2021, ICMI.

[2] Yongdong Zhang,et al. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning , 2021, ArXiv.

[3] Yann LeCun,et al. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , 2021, ICLR.

[4] Andrew Zisserman,et al. Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Nuno Vasconcelos,et al. Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] João F. Henriques,et al. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] K. Kashino,et al. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation , 2021, IEEE International Joint Conference on Neural Network.

[8] Xinlei Chen,et al. Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Aparna Khare,et al. Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[10] A. Mueller,et al. Detection of Maternal and Fetal Stress from ECG with Self-supervised Representation Learning , 2020 .

[11] Andrew Zisserman,et al. Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[12] Shamane Siriwardhana,et al. Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion , 2020, IEEE Access.

[13] Wei Liu,et al. Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[16] Andrea Vedaldi,et al. Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.

[17] Priya Goyal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[18] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[19] Luc Van Gool,et al. Learning To Classify Images Without Labels , 2020, ECCV.

[20] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] William T. Freeman,et al. SpeedNet: Learning the Speediness in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Geoffrey Zweig,et al. On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Wonjun Hwang,et al. Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction , 2020, ArXiv.

[24] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[26] A. Etemad,et al. Self-Supervised ECG Representation Learning for Emotion Recognition , 2020, IEEE Transactions on Affective Computing.

[27] Weiping Wang,et al. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning , 2020, AAAI.

[28] Laurens van der Maaten,et al. Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30] D. Mahajan,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[31] A. Etemad,et al. Self-Supervised Learning for ECG-Based Emotion Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[33] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34] Daniel Omeiza,et al. Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models , 2019, ArXiv.

[35] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[36] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[39] Longlong Jing,et al. Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.

[40] In-So Kweon,et al. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[41] Jon Barker,et al. SDC-Net: Video Prediction Using Spatially-Displaced Convolution , 2018, ECCV.

[42] Irfan A. Essa,et al. Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43] Björn Ommer,et al. Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning , 2018, ECCV.

[44] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[45] Ross B. Girshick,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[46] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[48] Boris Ginsburg,et al. Mixed Precision Training , 2017, ICLR.

[49] Graham W. Taylor,et al. Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[50] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[51] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52] Eric P. Xing,et al. Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[56] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57] Shunta Saito,et al. Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[58] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[59] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[60] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[61] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Lorenzo Torresani,et al. Deep End2End Voxel2Voxel Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[64] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[65] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[66] Dan Stowell,et al. Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[67] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[69] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[70] Yann LeCun,et al. Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[71] Daniel McDuff,et al. Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.

[72] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[73] Derek Hoiem. Action Recognition , 2014, Computer Vision, A Reference Guide.