Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking

Deep learning-based time-frequency (T-F) masking has dramatically advanced monaural (single-channel) speech separation and enhancement. This study investigates its potential for direction of arrival (DOA) estimation in noisy and reverberant environments. We explore ways of combining T-F masking and conventional localization algorithms, such as generalized cross correlation with phase transform, as well as newly proposed algorithms based on steered-response SNR and steering vectors. The key idea is to utilize deep neural networks (DNNs) to identify speech dominant T-F units containing relatively clean phase for DOA estimation. Our DNN is trained using only monaural spectral information, and this makes the trained model directly applicable to arrays with various numbers of microphones arranged in diverse geometries. Although only monaural information is used for training, experimental results show strong robustness of the proposed approach in new environments with intense noise and room reverberation, outperforming traditional DOA estimation methods by large margins. Our study also suggests that the ideal ratio mask and its variants remain effective training targets for robust speaker localization.

[1]  François Michaud,et al.  Time difference of arrival estimation based on binary frequency mask for sound source localization on mobile robots , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[2]  Douglas L. Jones,et al.  Localization of multiple acoustic sources with small arrays using a coherence test. , 2008, The Journal of the Acoustical Society of America.

[3]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[4]  Christian Ritz,et al.  Spectral mask estimation using deep neural networks for inter-sensor data ratio model based robust DOA estimation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Jean Rouat,et al.  Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering , 2007, Robotics Auton. Syst..

[12]  Zhong-Qiu Wang,et al.  Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks , 2018, INTERSPEECH.

[13]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[15]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  G. C. Carter,et al.  The smoothed coherence transform , 1973 .

[17]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[20]  Shengkui Zhao,et al.  Robust DOA estimation of multiple speech sources , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hiroshi Sawada,et al.  Doa Estimation for Multiple Sparse Sources with Normalized Observation Vector Clustering , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[24]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[25]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[27]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[28]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[29]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[30]  Bhaskar D. Rao,et al.  A Two Microphone-Based Approach for Source Localization of Multiple Speech Sources , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Pasi Pertilä,et al.  Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[33]  Yong Rui,et al.  Time delay estimation in the presence of correlated noise and reverberation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Rainer Martin,et al.  Binaural Speaker Localization Integrated Into an Adaptive Beamformer for Hearing Aids , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Guy J. Brown,et al.  Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localization of Multiple Sources in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  V Krishnaveni,et al.  Beamforming for Direction-of-Arrival (DOA) Estimation-A Survey , 2013 .

[38]  Jean Rouat,et al.  Robust sound source localization using a microphone array on a mobile robot , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[39]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Zhengyou Zhang,et al.  Why does PHAT work well in lownoise, reverberative environments? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[42]  Stefan B. Williams,et al.  Sound Source Localization in a Multipath Environment Using Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  DeLiang Wang,et al.  Recurrent deep stacking networks for supervised speech separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[46]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[47]  Hong-Goo Kang,et al.  On pre-filtering strategies for the GCC-PHAT algorithm , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[48]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[49]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[50]  Haizhou Li,et al.  Weighted Spatial Covariance Matrix Estimation for MUSIC Based TDOA Estimation of Speech Source , 2017, INTERSPEECH.

[51]  William M. Hartmann,et al.  How we localize sound , 1999 .