End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network

In this paper, we present an end-to-end approach for environmental sound classification based on a 1D Convolution Neural Network (CNN) that learns a representation directly from the audio signal. Several convolutional layers are used to capture the signal's fine time structure and learn diverse filters that are relevant to the classification task. The proposed approach can deal with audio signals of any length as it splits the signal into overlapped frames using a sliding window. Different architectures considering several input sizes are evaluated, including the initialization of the first convolutional layer with a Gammatone filterbank that models the human auditory filter response in the cochlea. The performance of the proposed end-to-end approach in classifying environmental sounds was assessed on the UrbanSound8k dataset and the experimental results have shown that it achieves 89% of mean accuracy. Therefore, the propose approach outperforms most of the state-of-the-art approaches that use handcrafted features or 2D representations as input. Furthermore, the proposed approach has a small number of parameters compared to other architectures found in the literature, which reduces the amount of data required for training.

[1]  Dan Stowell,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Juan Pablo Bello,et al.  A Software Framework for Musical Data Augmentation , 2015, ISMIR.

[4]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[5]  Laurent Girin,et al.  Assessing the performances of different neural network architectures for the detection of screams and shouts in public transportation , 2019, Expert Syst. Appl..

[6]  Jie Xie,et al.  Investigation of acoustic and visual features for acoustic scene classification , 2019, Expert Syst. Appl..

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Juan G. Roederer,et al.  The Physics and Psychophysics of Music: An Introduction , 2007 .

[9]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[10]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Shashidhar G. Koolagudi,et al.  Segmentation and characterization of acoustic event spectrograms using singular value decomposition , 2019, Expert Syst. Appl..

[12]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[14]  Jianjun Hu,et al.  An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition , 2018, Applied Sciences.

[15]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[16]  Xavier Serra,et al.  Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ludeña-ChoezJimmy,et al.  Acoustic Event Classification using spectral band selection and Non-Negative Matrix Factorization-based features , 2016 .

[19]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[20]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[21]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  Luiz Eduardo Soares de Oliveira,et al.  Music genre classification using LBP textural features , 2012, Signal Process..

[23]  Juan G. Roederer,et al.  The Physics and Psychophysics of Music , 1994 .

[24]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[26]  Mathieu Lagrange,et al.  Detection of overlapping acoustic events using a temporally-constrained probabilistic model , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[32]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[33]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[34]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[35]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Sacha Krstulovic,et al.  Automatic Environmental Sound Recognition: Performance Versus Computational Cost , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Jesse Engel,et al.  Learning Multiscale Features Directly from Waveforms , 2016, INTERSPEECH.

[38]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[39]  Luiz Eduardo Soares de Oliveira,et al.  An evaluation of Convolutional Neural Networks for music classification using spectrograms , 2017, Appl. Soft Comput..

[40]  Onur Dikmen,et al.  Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Jürgen T. Geiger,et al.  Improving event detection for audio surveillance using Gabor filterbank features , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[42]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[43]  Justin Salamon,et al.  The Implementation of Low-cost Urban Acoustic Monitoring Devices , 2016, ArXiv.

[44]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Patrick Cardinal,et al.  A Robust Approach for Securing Audio Classification Against Adversarial Attacks , 2019, IEEE Transactions on Information Forensics and Security.

[46]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[47]  Juhan Nam,et al.  Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Jingyu Wang,et al.  Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion , 2019, Sensors.

[49]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[52]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).