PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.

[1]  Fuchun Liu,et al.  Sound Source Localization and Speech Enhancement Algorithm Based on Fixed Beamforming , 2019, CACRE.

[2]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[3]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[4]  Gerhard P. Hancke,et al.  Sound based localization and identification in industrial environments , 2017, IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Sebastian Braun,et al.  Acoustic Localization Using Spatial Probability in Noisy and Reverberant Environments , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  W. Marsden I and J , 2012 .

[8]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  S. Araki,et al.  Multiple source localization using independent component analysis , 2005, 2005 IEEE Antennas and Propagation Society International Symposium.

[11]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Radu Horaud,et al.  Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds , 2014, Int. J. Neural Syst..

[13]  Hiroshi Yasukawa,et al.  Sound Localization of Approaching Vehicles Using Uniform Microphone Array , 2007, 2007 IEEE Intelligent Transportation Systems Conference.

[14]  John W. McDonough,et al.  Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[15]  Dongmei Pan,et al.  Speech Enhancement Algorithm Based on Sound Source Localization and Scene Matching for Binaural Digital Hearing Aids , 2019 .

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[19]  Tomohiro Nakatani,et al.  Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[20]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Sergey Levine,et al.  Backprop KF: Learning Discriminative Deterministic State Estimators , 2016, NIPS.

[22]  John W. McDonough,et al.  Kalman Filters for Time Delay of Arrival-Based Source Localization , 2005, EURASIP J. Adv. Signal Process..

[23]  Neil Genzlinger A. and Q , 2006 .

[24]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[25]  Yi-Wen Liu,et al.  Particle methods for real-time sound source localization based on the Multiple Signal Classification algorithm , 2014, 2014 International Conference on Intelligent Green Building and Smart Grid (IGBSG).

[26]  Harit Pandya,et al.  Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces , 2019, ICML.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  S. Araki,et al.  A DOA Based Speaker Diarization System for Real Meetings , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[29]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Tomohiro Nakatani,et al.  A Dynamic Stream Weight Backprop Kalman Filter for Audiovisual Speaker Tracking , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Oliver Brock,et al.  Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[32]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[33]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[34]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[35]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[36]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[37]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[38]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[39]  Reinhold Häb-Umbach,et al.  Online Diarization of Streaming Audio-Visual Data for Smart Environments , 2010, IEEE Journal of Selected Topics in Signal Processing.