End-to-end Convolutional Neural Networks for Sound Event Detection in Urban Environments

We present a novel approach to tackle the problem of sound event detection (SED) in urban environments using end-to-end convolutional neural networks (CNN). It consists of a 1D CNN for extracting the energy on mel-frequency bands from the audio signal based on a simple filter bank, followed by a 2D CNN for the classification task. The main goal of this two-stage architecture is to bring more interpretability to the first layers of the network and to permit their reutilization in other problems of same the domain. We present a novel model to calculate the mel-spectrogam using a neural network that outperforms an existing work, both in its simplicity and its matching performance. Also, we implement a recently proposed approach to normalize the energy of the mel-spectrogram (per channel energy normalization’ PCEN) as a layer of the neural network. We show how the parameters of this normalization can be learned by the network and why this is useful for SED on urban environments. We study how the training modifies the filter bank as well as the PCEN normalization parameters. The obtained system achieves classification results that are comparable to the state-of-the-art, while decreasing the number of parameters involved.

[1]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[2]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[3]  Justin Salamon,et al.  Feature learning with deep scattering for urban sound analysis , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[4]  Jianchao Zhou SOUND EVENT DETECTION IN MULTICHANNEL AUDIO LSTM NETWORK , 2017 .

[5]  Tuomas Virtanen,et al.  Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.

[6]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Toan H. Vu,et al.  ACOUSTIC SCENE AND EVENT RECOGNITION USING RECURRENT NEURAL NETWORKS , 2016 .

[8]  Bhiksha Raj,et al.  Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording , 2016, DCASE.

[9]  Nicholas W. D. Evans,et al.  End-to-end automatic speaker verification with evolving recurrent neural networks , 2018, Odyssey.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Tuomas Virtanen,et al.  End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[13]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[14]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[15]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Rui Lu BIDIRECTIONAL GRU FOR SOUND EVENT DETECTION , 2017 .

[18]  Hendrik Purwins,et al.  Utilizing Domain Knowledge in End-to-End Audio Processing , 2017, ArXiv.

[19]  Steele Daniel,et al.  THE SENSOR CITY INITIATIVE: COGNITIVE SENSORS FOR SOUNDSCAPE TRANSFORMATIONS , 2013 .

[20]  Colin Raffel,et al.  librosa: 0.4.1 , 2015 .

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Yi-Wen Liu,et al.  SOUND EVENT DETECTION FROM REAL-LIFE AUDIO BY TRAINING A LONG SHORT-TERM MEMORY NETWORK WITH MONO AND STEREO FEATURES , 2017 .

[23]  Florian Metze,et al.  Detection for Real Life Audio DCASE Challenge , 2016 .

[24]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Michael Bain,et al.  B-CNN: Branch Convolutional Neural Network for Hierarchical Classification , 2017, ArXiv.

[26]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[27]  Charlie Mydlarz The design and calibration of low cost urban acoustic sensing devices , 2015 .

[28]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[29]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[31]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[32]  Vincent Lostanlen,et al.  Per-Channel Energy Normalization: Why and How , 2019, IEEE Signal Processing Letters.

[33]  Mark D. Plumbley,et al.  Deep Neural Network Baseline for DCASE Challenge 2016 , 2016, DCASE.

[34]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[35]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[36]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[37]  Agnieszka Roginska,et al.  The Implementation of MEMS Microphones for Urban Sound Sensing , 2014 .