On Spatial Features for Supervised Speech Separation and its Application to Beamforming and Robust ASR

This study integrates complementary spectral and spatial information to elevate deep learning based time-frequency masking and acoustic beamforming. Coherence and directional features are designed as additional input features for deep neural network training to remove diffuse noise and other directional interferences pervasive in real-world recordings. The diffuse and directional features are designed to be relatively invariant to the underlying target direction, number of microphones and microphone geometry. The estimated masks are then utilized to compute steering vectors and spatial covariance matrices for beamforming and robust ASR. Experiments on the CHiME-4 dataset demonstrate the effectiveness of the proposed approach.

[1]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[2]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Walter Kellermann,et al.  Robust coherence-based spectral enhancement for distant speech recognition , 2015, ArXiv.

[4]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[9]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Petros Maragos,et al.  A Phase-Based Time-Frequency Masking for Multi-Channel Speech Enhancement in Domestic Environments , 2016, INTERSPEECH.

[11]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Zhong-Qiu Wang,et al.  Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition , 2015, INTERSPEECH.

[13]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Walter Kellermann,et al.  Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments , 2017, Comput. Speech Lang..

[15]  DeLiang Wang,et al.  Unsupervised speaker adaptation of batch normalized acoustic models for robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Pasi Pertilä,et al.  Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jun Du,et al.  On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones , 2017, INTERSPEECH.

[18]  DeLiang Wang,et al.  A two-stage algorithm for noisy and reverberant speech enhancement , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Fengyun Zhu,et al.  Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network , 2015, ArXiv.

[20]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[23]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[24]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[25]  Ivan Tashev,et al.  MICROPHONE ARRAY POST-PROCESSOR USING INSTANTANEOUS DIRECTION OF ARRIVAL , 2006 .

[26]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Reinhold Haeb-Umbach,et al.  Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition , 2016 .

[28]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30]  Cong Liu,et al.  The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[31]  Zhong-Qiu Wang,et al.  Mask Weighted Stft Ratios for Relative Transfer Function Estimation and ITS Application to Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Pasi Pertilä,et al.  Microphone array post-filtering using supervised machine learning for speech enhancement , 2014, INTERSPEECH.