Environmental sound classification using a regularized deep convolutional neural network with data augmentation

Abstract The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to various environmental sounds. Some frequent and common aspects like non-uniform distance between acoustic source and microphone, the difference in the framework, presence of numerous sounds sources in audio recordings and overlapping various sound events make this ESC problem much complex and complicated. This study is to employ deep convolutional neural networks (DCNN) with regularization and data enhancement with basic audio features that have verified to be efficient on ESC tasks. In this study, the performance of DCNN with max-pooling (Model-1) and without max-pooling (Model-2) function are examined. Three audio attribute extraction techniques, Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC) and Log-Mel, are considered for the ESC-10, ESC-50, and Urban sound (US8K) datasets. Furthermore, to avoid the risk of overfitting due to limited numbers of data, this study also introduces offline data augmentation techniques to enhance the used datasets with a combination of L2 regularization. The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets. For ESC-10, ESC-50 and US8K, the highest achieved accuracies are 94.94%, 89.28%, and 95.37% respectively. The experimental results show that the proposed approach can accomplish the best performance on environment sound classification problems.

[1]  Aditya Khamparia,et al.  Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network , 2019, IEEE Access.

[2]  Shugong Xu,et al.  Learning Attentive Representations for Environmental Sound Classification , 2019, IEEE Access.

[3]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[4]  Florian Metze,et al.  A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jaeyoung Choi,et al.  DCAR: A Discriminative and Compact Audio Representation for Audio Processing , 2017, IEEE Transactions on Multimedia.

[6]  M. Julia Flores,et al.  Machine learning for music genre: multifaceted review and experimentation with audioset , 2019, Journal of Intelligent Information Systems.

[7]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[8]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[9]  Jianjun Hu,et al.  An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition , 2018, Applied Sciences.

[10]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[11]  Dan Istrate,et al.  Sound Detection and Classification for Medical Telesurvey , 2004 .

[12]  Zulfiqar Ali,et al.  Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments , 2018, IEEE Access.

[13]  Vasileios Bountourakis,et al.  Machine Learning Algorithms for Environmental Sound Recognition: Towards Soundscape Semantics , 2015, AM '15.

[14]  An Braeken,et al.  Evaluation of Classical Machine Learning Techniques towards Urban Sound Recognitionon Embedded Systems , 2019, Applied Sciences.

[15]  Teerapong Orachon,et al.  Crime warning system using image and sound processing , 2013, 2013 13th International Conference on Control, Automation and Systems (ICCAS 2013).

[16]  Marc Green,et al.  Environmental sound monitoring using machine learning on mobile devices , 2020, Applied Acoustics.

[17]  Yuhua Qian,et al.  Environmental sound classification with dilated convolutions , 2019, Applied Acoustics.

[18]  Takumi Kobayashi,et al.  Audio Data Mining for Anthropogenic Disaster Identification: An Automatic Taxonomy Approach , 2020, IEEE Transactions on Emerging Topics in Computing.

[19]  Jie Huang,et al.  Robot navigation and sound based position identification , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[20]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[22]  Jing Xiao,et al.  Audio-Based Music Classification with DenseNet And Data Augmentation , 2019, PRICAI.

[23]  C.-C. Jay Kuo,et al.  Environmental sound recognition: a survey , 2014 .

[24]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[25]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.