Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervised Learning approach for audio representation via Feature Diversity and Decorrelation. SELFIE avoids the collapsing issue by ensuring that the representation (i) maintains a high diversity among embeddings and (ii) decorrelates the dependencies between dimensions. SELFIE is pre-trained on the large-scale AudioSet dataset and its embeddings are validated on nine audio downstream tasks, including speech, music, and sound event recognition. Experimental results show that SELFIE outperforms existing SSL methods in several tasks.

[1]  K. Kashino,et al.  BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Bin Hu,et al.  Audio self-supervised learning: A survey , 2022, Patterns.

[3]  Aäron van den Oord,et al.  Towards Learning Universal Audio Representations , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yann LeCun,et al.  Understanding Dimensional Collapse in Contrastive Self-supervised Learning , 2021, ICLR.

[5]  A. Jansen,et al.  Universal Paralinguistic Speech Representations Using self-Supervised Conformers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yann LeCun,et al.  VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , 2021, ICLR.

[7]  X. Serra,et al.  FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  George Tzanetakis,et al.  One Billion Audio Sounds from GPU-Enabled Modular Synthesis , 2021, 2021 24th International Conference on Digital Audio Effects (DAFx).

[10]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Noel E. O'Connor,et al.  Unsupervised Contrastive Learning of Sound Event Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Neil Zeghidour,et al.  Contrastive Learning of General-Purpose Audio Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yalda Mohsenzadeh,et al.  CLAR: Contrastive Learning of Auditory Representations , 2020, AISTATS.

[14]  Kunio Kashino,et al.  The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation , 2020, ArXiv.

[15]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[16]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[17]  Xavier Serra,et al.  COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations , 2020, ArXiv.

[18]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[19]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[20]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[21]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[22]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[23]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[25]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[26]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[27]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Ross B. Girshick,et al.  Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[30]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[31]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[33]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[34]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..