Weighted Spatial Covariance Matrix Estimation for MUSIC Based TDOA Estimation of Speech Source

We study the estimation of time difference of arrival (TDOA) under noisy and reverberant conditions. Conventional TDOA estimation methods such as MUltiple SIgnal Classification (MUSIC) are not robust to noise and reverberation due to the distortion in the spatial covariance matrix (SCM). To address this issue, this paper proposes a robust SCM estimation method, called weighted SCM (WSCM). In the WSCM estimation, each time-frequency (TF) bin of the input signal is weighted by a TF mask which is 0 for non-speech TF bins and 1 for speech TF bins in ideal case. In practice, the TF mask takes values between 0 and 1 that are predicted by a long short term memory (LSTM) network trained from a large amount of simulated noisy and reverberant data. The use of mask weights significantly reduces the contribution of low SNR TF bins to the SCM estimation, hence improves the robustness of MUSIC. Experimental results on both simulated and real data show that we have significantly improved the robustness of MUSIC by using the weighted SCM.

[1]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[2]  Hiroshi Sawada,et al.  Grouping Separated Frequency Components by Estimating Propagation Model Parameters in Frequency-Domain Blind Source Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Haizhou Li,et al.  Beamforming networks using spatial covariance features for far-field speech recognition , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[6]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Jacob Benesty,et al.  Real-time passive source localization: a practical linear-correction least-squares approach , 2001, IEEE Trans. Speech Audio Process..

[8]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[13]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[14]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[15]  Shengkui Zhao,et al.  Robust DOA estimation of multiple speech sources , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Guy J. Brown,et al.  A Robust Dual-Microphone Speech Source Localization Algorithm for Reverberant Environments , 2016, INTERSPEECH.

[18]  Shengkui Zhao,et al.  A real-time 3D sound localization system with miniature microphone array for virtual reality , 2012, 2012 7th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[19]  Eric A. Lehmann,et al.  Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Benesty,et al.  Adaptive eigenvalue decomposition algorithm for passive acoustic source localization , 2000, The Journal of the Acoustical Society of America.

[21]  Douglas L. Jones,et al.  A Study of Learning Based Beamforming Methods for Speech Recognition , 2016 .

[22]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  James R. Hopgood,et al.  A Time–Frequency Masking Based Random Finite Set Particle Filtering Method for Multiple Acoustic Source Detection and Tracking , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.