Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction using OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modalities, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

[1]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Yi Li,et al.  Learning Representations from Audio-Visual Spatial Alignment , 2020, NeurIPS.

[4]  Ender Konukoglu,et al.  Contrastive learning of global and local features for medical image segmentation with limited annotations , 2020, NeurIPS.

[5]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[6]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[8]  Qi Liu,et al.  Multi-Task Self-Supervised Learning for Disfluency Detection , 2019, AAAI.

[9]  Nuno Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, ArXiv.

[10]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Paolo Favaro,et al.  Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Marco Tagliasacchi,et al.  Self-supervised audio representation learning for mobile devices , 2019, ArXiv.

[15]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[16]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[17]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[20]  Weiping Wang,et al.  Dense Semantic Contrast for Self-Supervised Visual Representation Learning , 2021, ACM Multimedia.

[21]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[22]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[24]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[25]  Nathanael Perraudin,et al.  A Context Encoder For Audio Inpainting , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Quoc V. Le,et al.  Multi-Task Self-Training for Learning General Representations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Stefano Ermon,et al.  Audio Super Resolution using Neural Networks , 2017, ICLR.

[29]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[30]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[31]  Xiaogang Wang,et al.  Vision-Infused Deep Audio Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Luc Van Gool,et al.  Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds , 2020, ECCV.

[33]  Tao Kong,et al.  Dense Contrastive Learning for Self-Supervised Visual Pre-Training , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[37]  Jiebo Luo,et al.  AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[42]  Luc Van Gool,et al.  Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Chao Wang,et al.  Multi-Task Self-Supervised Pre-Training for Music Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[47]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[48]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[49]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[50]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[51]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Ildoo Kim,et al.  Spatially Consistent Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Titouan Parcollet,et al.  Pretext Tasks selection for multitask self-supervised speech representation learning , 2021, ArXiv.

[56]  In-So Kweon,et al.  Learning Image Representations by Completing Damaged Jigsaw Puzzles , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[58]  Oriol Vinyals,et al.  Efficient Visual Pretraining with Contrastive Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Luc Van Gool,et al.  Multi-Task Learning for Dense Prediction Tasks: A Survey. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[61]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[63]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[64]  Karim Helwani,et al.  Self-Supervised Classification for Detecting Anomalous Sounds , 2020, DCASE.

[65]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[66]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[67]  Fabio Maria Carlucci,et al.  Domain Generalization by Solving Jigsaw Puzzles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71]  Wonhee Lee,et al.  Multi-Task Self-Supervised Object Detection via Recycling of Bounding Box Annotations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[73]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[74]  Gregory Shakhnarovich,et al.  Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Zhenguo Li,et al.  DetCo: Unsupervised Contrastive Learning for Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).