SELD-TCN: Sound Event Localization & Detection via Temporal Convolutional Networks

The understanding of the surrounding environment plays a critical role in autonomous robotic systems, such as self-driving cars. Extensive research has been carried out concerning visual perception. Yet, to obtain a more complete perception of the environment, autonomous systems of the future should also take acoustic information into account. Recent sound event localization and detection (SELD) frameworks utilize convolutional recurrent neural networks (CRNNs). However, considering the recurrent nature of CRNNs, it becomes challenging to implement them efficiently on embedded hardware. Not only are their computations strenuous to parallelize, but they also require high memory bandwidth and large memory buffers. In this work, we develop a more robust and hardware-friendly novel architecture based on a temporal convolutional network (TCN). The proposed framework (SELD-TCN) outperforms the state-of-the-art SELDnet performance on four different datasets. Moreover, SELD-TCN achieves 4x faster training time per epoch and 40x faster inference time on an ordinary graphics processing unit (GPU).

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Berin Martini,et al.  Recurrent Neural Networks Hardware Implementation on FPGA , 2015, ArXiv.

[3]  Gibson Lam,et al.  Automatic Audio Indexing and Audio Playback Speed Control as Tools for Language Learning , 2006, ICWL.

[4]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[5]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[6]  Guillaume Lemaitre,et al.  Real-Time Detection of Overlapping Sound Events with Non-Negative Matrix Factorization , 2013 .

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Maarten De Vos,et al.  Unifying Isolated and Overlapping Audio Event Detection with Multi-label Multi-task Convolutional Recurrent Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tuomas Virtanen,et al.  Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.

[10]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[11]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[12]  Tomoki Toda,et al.  Duration-Controlled LSTM for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Tillman Weyde,et al.  An Efficient Temporally-Constrained Probabilistic Model for Multiple-Instrument Music Transcription , 2015, ISMIR.

[14]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hiroyuki Kasai,et al.  NMF-based environmental sound source separation using time-variant gain features , 2012, Comput. Math. Appl..

[16]  Eugenio Culurciello,et al.  Hardware accelerators for recurrent neural networks on FPGA , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[17]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Tuomas Virtanen,et al.  End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[19]  Anssi Klapuri,et al.  Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation , 2009, ISMIR.

[20]  Francesco Piazza,et al.  A neural network based algorithm for speaker localization in a multi-room environment , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[21]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[22]  Kazunori Komatani,et al.  Discriminative multiple sound source localization based on deep neural networks using independent location model , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[23]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[25]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[26]  Stephan Gerlach,et al.  On sound source localization of speech signals using deep neural networks , 2015 .

[27]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[28]  Archontis Politis,et al.  Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network , 2019, DCASE.

[29]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[30]  Challenge on Detection and Classification of Acoustic Scenes and Events ACOUSTIC EVENT DETECTION USING SIGNAL ENHANCEMENT AND SPECTRO-TEMPORAL FEATURE EXTRACTION , 2013 .

[31]  Reishi Kondo,et al.  Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries , 2016, DCASE.

[32]  Pablo Cancela,et al.  End-to-end Convolutional Neural Networks for Sound Event Detection in Urban Environments , 2019, 2019 24th Conference of Open Innovations Association (FRUCT).

[33]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[34]  P. Tyack,et al.  Estimating animal population density using passive acoustics , 2012, Biological reviews of the Cambridge Philosophical Society.

[35]  Seokwon Jung,et al.  Polyphonic Sound Event Detection Using Convolutional Bidirectional Lstm and Synthetic Data-based Transfer Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Archontis Politis,et al.  Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[37]  Franz Pernkopf,et al.  Virtual Adversarial Training and Data Augmentation for Acoustic Event Detection with Gated Recurrent Neural Networks , 2017, INTERSPEECH.

[38]  Emmanuel Vincent,et al.  Deep Neural Network Based Multichannel Audio Source Separation , 2018 .

[39]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).