Characterization of Moving Sound Sources Direction-of-Arrival Estimation Using Different Deep Learning Architectures

Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While a number of previous works have focused on static sound sources, in this article, we evaluate the performance of a deep learning classification system for localization of moving sound sources. In particular, we evaluate the effect of key parameters at the levels of feature extraction (e.g., short-time Fourier transform (STFT) parameters) and model training (e.g., neural network (NN) architectures). We evaluate the performance of different settings in terms of precision and F-score, in a multiclass multilabel classification framework. In our previous work for localization of moving sound sources, we investigated feedforward NNs (FNNs) under different acoustic conditions and STFT parameters and showed that the presence of some reverberation in the training dataset can help in achieving better detection for the direction of arrival of the sources. In this article, we extend the work to show that the window size does not affect the performance of static sources but highly affects the performance of moving sources, a sequence length has a significant effect on the performance of recurrent architectures, and a temporal convolutional NN can outperform both recurrent and feedforward networks for moving sound sources.

[1]  S. Shirmohammadi,et al.  Direction of Arrival Estimation of Moving Sound Sources using Deep Learning , 2022, 2022 IEEE International Instrumentation and Measurement Technology Conference (I2MTC).

[2]  Laurent Girin,et al.  A Survey of Sound Source Localization with Deep Learning Methods , 2021, The Journal of the Acoustical Society of America.

[3]  Laurent Girin,et al.  Improved feature extraction for CRNN-based multiple sound source localization , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[4]  Sharon Gannot,et al.  Dynamically localizing multiple speakers based on the time-frequency domain , 2021, EURASIP J. Audio Speech Music. Process..

[5]  Kin Wai Cheuk,et al.  nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks , 2020, IEEE Access.

[6]  Antonio Miguel,et al.  Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Christoph Schorn,et al.  SELD-TCN: Sound Event Localization & Detection via Temporal Convolutional Networks , 2020, 2020 28th European Signal Processing Conference (EUSIPCO).

[8]  Sharon Gannot,et al.  Deep Ranking-Based Sound Source Localization , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Seokwon Jung,et al.  Polyphonic Sound Event Detection Using Convolutional Bidirectional Lstm and Synthetic Data-based Transfer Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Archontis Politis,et al.  Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network , 2019, DCASE.

[11]  Guy J. Brown,et al.  End-to-end Binaural Sound Localisation from the Raw Waveform , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Pablo Cancela,et al.  End-to-end Convolutional Neural Networks for Sound Event Detection in Urban Environments , 2019, 2019 24th Conference of Open Innovations Association (FRUCT).

[13]  Saulius Sakavičius,et al.  Estimation of Sound Source Direction of Arrival Map Using Convolutional Neural Network and Cross-Correlation in Frequency Bands , 2019, 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream).

[14]  Xiaofei Li,et al.  Multitask Learning of Time-Frequency CNN for Sound Source Localization , 2019, IEEE Access.

[15]  Emmanuel Vincent,et al.  CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[16]  Emanuel A. P. Habets,et al.  Multi-scale Aggregation of Phase Information for Complexity Reduction of CNN Based DOA Estimation , 2018, 2019 27th European Signal Processing Conference (EUSIPCO).

[17]  Jielin Pan,et al.  A Regression Approach to Speech Source Localization Exploiting Deep Neural Network , 2018, 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM).

[18]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[19]  Guy J. Brown,et al.  Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[21]  Guy J. Brown,et al.  Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localization of Multiple Sources in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  S. Gannot,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[29]  P. Svaizer,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[30]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[31]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .