Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of 29.4% was achieved using the ground truth localization information and 42.4% using the localization information estimated via GCC-PHAT. Though higher signal-to-interference ratio (SIR) between the speakers was found to positively impact the speech separation performance, equivalent performances were obtained for mixtures with lower SIR values when the speakers are well separated in space.

[1]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[2]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006, IEEE Trans. Neural Networks.

[3]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[5]  Marc Moonen,et al.  Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction , 2003, Signal Process..

[6]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hermann Ney,et al.  Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[8]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Emmanuel Vincent,et al.  Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xiong Xiao,et al.  Cracking the cocktail party problem by multi-beam deep attractor network , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  DeLiang Wang,et al.  Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  Yonghong Yan,et al.  Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments , 2017, Comput. Speech Lang..

[16]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[17]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Emmanuel Vincent,et al.  Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment , 2018, INTERSPEECH.

[19]  Tomohiro Nakatani,et al.  Listening to Each Speaker One by One with Recurrent Selective Hearing Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[22]  Walter Kellermann,et al.  On the Impact of Localization Errors on HRTF-based Robust Least-Squares Beamforming , 2016, ArXiv.

[23]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[24]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[25]  Emanuel A. P. Habets,et al.  DOA-informed source extraction in the presence of competing talkers and background noise , 2017, EURASIP J. Adv. Signal Process..

[26]  John McDonough,et al.  Distant Speech Recognition , 2009 .