Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Wathiq Mansoor,et al.  Sound source localization for automatic camera steering , 2011, The 7th International Conference on Networked Computing and Advanced Information Management.

[3]  M. Omologo,et al.  Comparison Between Different Sound Source Localization Techniques Based on a Real Data Collection , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[4]  Tetsuya Ogata,et al.  Sound Source Localization Using Deep Learning Models , 2017, J. Robotics Mechatronics.

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Pasi Pertilä,et al.  Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[8]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[10]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[11]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[13]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Zhong-Qiu Wang,et al.  Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks , 2018, INTERSPEECH.

[16]  Javier Macías Guarasa,et al.  Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates , 2018, Sensors.

[17]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Simo Srkk,et al.  Bayesian Filtering and Smoothing , 2013 .

[19]  Carlo Drioli,et al.  Exploiting CNNs for Improving Acoustic Source Localization in Noisy and Reverberant Conditions , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[20]  Francesco Piazza,et al.  Localizing speakers in multiple rooms by using Deep Neural Networks , 2018, Comput. Speech Lang..

[21]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[22]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[23]  Simo Särkkä,et al.  Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[24]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[25]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[26]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).