Self-supervised learning for Environmental Sound Classification

Abstract Environmental Sound Classification (ESC) is one of the most challenging tasks in signal processing, digital forensic and machine learning. Numerous methods have been proposed to perform ESC. The conventional models’ training depends on an enormous amount of annotated data, specifically while training the deep models. This paper presents a self-supervised learning (SSL)-based deep classifier for ESC, which is an under-explored method in the field of ESC. SSL mechanism directs the model to effectively learn prototypical features from the data itself by solving a pretext task. The model proposed in this paper takes spectrogram images as input. A pretext or an auxiliary task is defined as identification of the type of data augmentation applied to the signal. The model learned by solving the pretext task is further fine-tuned for developing the deep model for ESC. The model’s performance is evaluated on two benchmark sound classification datasets, i.e. ESC-10 and DCASE 2019 Task-1(A) datasets. The experiments and results show that the SSL model attains an improvement of 12.59% and 11.17% in accuracy compared to the baseline models of the DCASE 2019 Task-1(A) and ESC-10 datasets respectively. Moreover, the model also shows competitive performance to state-of-the-art methods.

[1]  Xianglin Huang,et al.  Environmental sound classification based on feature fusion , 2018 .

[2]  Huy Phan,et al.  Comparing time and frequency domain for audio event recognition using deep learning , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[3]  Yang Liu,et al.  Collaborative Security , 2015, ACM Comput. Surv..

[4]  Jie Xie,et al.  Handcrafted features and late fusion with deep learning for bird sound classification , 2019, Ecol. Informatics.

[5]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[6]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Artur S. d'Avila Garcez,et al.  Speaker recognition with hybrid features from a deep belief network , 2018, Neural Computing and Applications.

[8]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[9]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[11]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[12]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  Erhan Akbal,et al.  An automated environmental sound classification methods based on statistical and textural feature , 2020 .

[14]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Shrikanth Narayanan,et al.  Data Augmentation Using GANs for Speech Emotion Recognition , 2019, INTERSPEECH.

[16]  Aditya Khamparia,et al.  Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network , 2019, IEEE Access.

[17]  Aswathy Madhu,et al.  Data Augmentation Using Generative Adversarial Network for Environmental Sound Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[18]  Zohaib Mushtaq,et al.  Efficient Classification of Environmental Sounds through Multiple Features Aggregation and Data Enhancement Techniques for Spectrogram Images , 2020, Symmetry.

[19]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[20]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[21]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[24]  Jürgen T. Geiger,et al.  Improving event detection for audio surveillance using Gabor filterbank features , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[25]  Zohaib Mushtaq,et al.  Spectral images based environmental sound classification using CNN with meaningful data augmentation , 2021 .

[26]  Shang-Wen Li,et al.  Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[27]  Justin Salamon,et al.  Sound analysis in smart cities , 2018 .

[28]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[29]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[30]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[32]  Reinhold Häb-Umbach,et al.  A study on transfer learning for acoustic event detection in a real life scenario , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[33]  Loris Nanni,et al.  Data augmentation approaches for improving animal audio classification , 2020, Ecol. Informatics.

[34]  Abdulkadir Sengur,et al.  Environmental sound classification using optimum allocation sampling based empirical mode decomposition , 2020 .

[35]  Zohaib Mushtaq,et al.  Environmental sound classification using a regularized deep convolutional neural network with data augmentation , 2020, Applied Acoustics.

[36]  Marco Tagliasacchi,et al.  Pre-Training Audio Representations With Self-Supervision , 2020, IEEE Signal Processing Letters.

[37]  Nithya Davis,et al.  Environmental Sound Classification Using Deep Convolutional Neural Networks and Data Augmentation , 2018, 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS).

[38]  Patrick Cardinal,et al.  End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network , 2019, Expert Syst. Appl..

[39]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[40]  Hannes Gamper,et al.  The effect of room acoustics on audio event classification , 2019 .

[41]  Shuai Wang,et al.  Deep learning for sentiment analysis: A survey , 2018, WIREs Data Mining Knowl. Discov..

[42]  Jianjun Hu,et al.  An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition , 2018, Applied Sciences.

[43]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[44]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.