论文信息 - PILOT: Introducing Transformers for Probabilistic Sound Event Localization

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.

[1] Fuchun Liu,et al. Sound Source Localization and Speech Enhancement Algorithm Based on Fixed Beamforming , 2019, CACRE.

[2] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[3] Quoc V. Le,et al. Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[4] Gerhard P. Hancke,et al. Sound based localization and identification in industrial environments , 2017, IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society.

[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[6] Sebastian Braun,et al. Acoustic Localization Using Spatial Probability in Noisy and Reverberant Environments , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7] W. Marsden. I and J , 2012 .

[8] S. Shapiro,et al. An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[10] S. Araki,et al. Multiple source localization using independent component analysis , 2005, 2005 IEEE Antennas and Propagation Society International Symposium.

[11] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Radu Horaud,et al. Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds , 2014, Int. J. Neural Syst..

[13] Hiroshi Yasukawa,et al. Sound Localization of Approaching Vehicles Using Uniform Microphone Array , 2007, 2007 IEEE Intelligent Transportation Systems Conference.

[14] John W. McDonough,et al. Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[15] Dongmei Pan,et al. Speech Enhancement Algorithm Based on Sound Source Localization and Scene Matching for Binaural Digital Hearing Aids , 2019 .

[16] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17] Emanuel A. P. Habets,et al. Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[19] Tomohiro Nakatani,et al. Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[20] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Sergey Levine,et al. Backprop KF: Learning Discriminative Deterministic State Estimators , 2016, NIPS.

[22] John W. McDonough,et al. Kalman Filters for Time Delay of Arrival-Based Source Localization , 2005, EURASIP J. Adv. Signal Process..

[23] Neil Genzlinger. A. and Q , 2006 .

[24] Archontis Politis,et al. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[25] Yi-Wen Liu,et al. Particle methods for real-time sound source localization based on the Multiple Signal Classification algorithm , 2014, 2014 International Conference on Intelligent Green Building and Smart Grid (IGBSG).

[26] Harit Pandya,et al. Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces , 2019, ICML.

[27] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[28] S. Araki,et al. A DOA Based Speaker Diarization System for Real Meetings , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[29] Petr Motlícek,et al. Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[30] Tomohiro Nakatani,et al. A Dynamic Stream Weight Backprop Kalman Filter for Audiovisual Speaker Tracking , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Oliver Brock,et al. Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[32] Archontis Politis,et al. Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[33] Sebastian Thrun,et al. Probabilistic robotics , 2002, CACM.

[34] R. O. Schmidt,et al. Multiple emitter location and signal Parameter estimation , 1986 .

[35] Michael S. Brandstein,et al. Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[36] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[37] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.

[38] H. B. Mann,et al. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[39] Reinhold Häb-Umbach,et al. Online Diarization of Streaming Audio-Visual Data for Smart Environments , 2010, IEEE Journal of Selected Topics in Signal Processing.