How to (virtually) train your sound source localizer

Learning-based methods have become ubiquitous in sound source localization (SSL). Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustic simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more realistic surface, source and microphone responses into training sets can significantly boost the real-world performance of SSL systems. It is shown that increasing the training-set realism of a state-of-the-art direction-of-arrival estimator yields consistent improvements across three different real test sets featuring human speakers in a variety of rooms and various microphone arrays. An ablation study further reveals that every added layer of realism contributes positively to these improvements.

[1]  T. Lokki,et al.  Near-Field Evaluation of Reproducible Speech Sources , 2022, Journal of the Audio Engineering Society.

[2]  E. Vincent,et al.  Realistic Sources, Receivers and Walls Improve The Generalisability of Virtually-Supervised Blind Acoustic Parameter Estimators , 2022, 2022 International Workshop on Acoustic Signal Enhancement (IWAENC).

[3]  K. Grauman,et al.  SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning , 2022, NeurIPS.

[4]  T. Virtanen,et al.  STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events , 2022, DCASE.

[5]  D. Manocha,et al.  GWA: A Large High-Quality Acoustic Dataset for Audio Processing , 2022, SIGGRAPH.

[6]  Thilo von Neumann,et al.  Monaural Source Separation: From Anechoic To Reverberant Environments , 2021, 2022 International Workshop on Acoustic Signal Enhancement (IWAENC).

[7]  Naoya Takahashi,et al.  Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Laurent Girin,et al.  A Survey of Sound Source Localization with Deep Learning Methods , 2021, The Journal of the Acoustical Society of America.

[9]  Antoine Deleforge,et al.  Mean absorption estimation from room impulse responses using virtually supervised learning , 2021, The Journal of the Acoustical Society of America.

[10]  Tor André Myrvoll,et al.  Synthetic Data For Dnn-Based Doa Estimation of Indoor Speech , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Antonio Miguel,et al.  Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jean-Marc Odobez,et al.  Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Zhong-Qiu Wang,et al.  Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Chengshi Zheng,et al.  Joint estimation of binaural distance and azimuth by exploiting deep neural networks. , 2020, The Journal of the Acoustical Society of America.

[15]  Douglas L. Jones,et al.  Robust Source Counting and DOA Estimation Using Spatial Pseudo-Spectrum and Convolutional Neural Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Emmanuel Vincent,et al.  CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[17]  Emmanuel Vincent,et al.  VoiceHome-2, an extended corpus for multichannel speech processing in real homes , 2019, Speech Commun..

[18]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[19]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[20]  Matthias Frank,et al.  DirPat—Database and Viewer of 2D/3D Directivity Patterns of Sound Sources and Receivers , 2018 .

[21]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Antoine Deleforge,et al.  VAST: The Virtual Acoustic Space Traveler Dataset , 2016, LVA/ICA.

[24]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Stefan Bilbao,et al.  Modeling of Complex Geometries and Boundary Conditions in Finite Difference/Finite Volume Time Domain Room Acoustics Simulation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Tapio Lokki,et al.  The room acoustic rendering equation. , 2007, The Journal of the Acoustical Society of America.

[28]  P. Peterson Simulating the response of multiple microphones to a single acoustic source in a reverberant room. , 1986, The Journal of the Acoustical Society of America.

[29]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .