Robust Interaural Time Difference Estimation Based on Convolutional Neural Network

This paper proposes a novel cross correlation function (CCF) extraction method based on convolutional neural network for time difference of arrival (TDOA) estimation or further direction of arrival (DOA) estimation. CNN is utilized to learn the relationship between the cross correlation localization features and the pre-processed waveform signal which may include not only the source signal but also the background noise and reverberation. In contrast to many previous sound source localization approaches, the proposed method focuses on the spatial feature extraction. Two kind of outputs, grouped or encoded CCF, are designed to capture the implicit tendency of location information. The experimental results demonstrate that the proposed method outperforms the conventional TDOA estimation methods under environments with different levels of noise and reverberation.

[1]  Kazunori Komatani,et al.  Unsupervised adaptation of deep neural networks for sound source localization using entropy minimization , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Petr Motlícek,et al.  Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network , 2018, INTERSPEECH.

[3]  Steven van de Par,et al.  A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[5]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhong-Qiu Wang,et al.  Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks , 2018, INTERSPEECH.

[7]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[8]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[9]  Peter R. Roth,et al.  Effective measurements using digital signal analysis , 1971, IEEE Spectrum.

[10]  G. C. Carter,et al.  The smoothed coherence transform , 1973 .

[12]  Daniel P. W. Ellis Computational Auditory Scene Analysis: Principles, Practice and Applications , 1999 .

[13]  D. R. Campbell,et al.  A MATLAB Simulation of “ Shoebox ” Room Acoustics for use in Research and Teaching , 2022 .

[14]  Mikko Parviainen,et al.  Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Bhaskar D. Rao,et al.  A Two Microphone-Based Approach for Source Localization of Multiple Speech Sources , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[18]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[19]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[20]  Jong Won Shin,et al.  Sound Localization Based on Phase Difference Enhancement Using Deep Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .