Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results

Recognizing speech under noisy condition is an ill-posed problem. The CHiME 3 challenge targets robust speech recognition in realistic environments such as street, bus, caffee and pedestrian areas. We study variants of beamformers used for pre-processing multi-channel speech recordings. In particular, we investigate three variants of generalized side-lobe canceller (GSC) beamformers, i.e. GSC with sparse blocking matrix (BM), GSC with adaptive BM (ABM), and GSC with minimum variance distortionless response (MVDR) and ABM. Furthermore, we apply several post-filters to further enhance the speech signal. We introduce MaxPower postfilters and deep neural postfilters (DPFs). DPFs outperformed our baseline systems significantly when measuring the overall perceptual score (OPS) and the perceptual evaluation of speech quality (PESQ). In particular DPFs achieved an average relative improvement of 17.54% OPS points and 18.28% in PESQ, when compared to the CHiME 3 baseline. DPFs also achieved the best WER when combined with an ASR engine on simulated development and evaluation data, i.e. 8.98% and 10.82% WER. The proposed MaxPower beamformer achieved the best overall WER on CHiME 3 real development and evaluation data, i.e. 14.23% and 22.12%, respectively.

[1]  Franz Pernkopf,et al.  Blind source extraction based on a direction-dependent a-priori SNR , 2014, INTERSPEECH.

[2]  Franz Pernkopf,et al.  Single channel source separation with general stochastic networks , 2014, INTERSPEECH.

[3]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Israel Cohen,et al.  A sparse blocking matrix for multiple constraints GSC beamformer , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Walter Kellermann,et al.  Analysis of blocking matrices for generalized sidelobe cancellers for non-stationary broadband signals , 2002, ICASSP.

[7]  Reinhold Häb-Umbach,et al.  Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Yonghong Yan,et al.  An approach of adaptive blocking matrix based on frequency domain independent component analysis in generalized sidelobe canceller , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[9]  Mark Hasegawa-Johnson,et al.  Generalized Optimal Multi-Microphone Speech Enhancement Using Sequential Minimum Variance Distortionless Response(MVDR) Beamforming and Postfiltering , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Franz Pernkopf,et al.  Representation Learning for Single-Channel Source Separation and Bandwidth Extension , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Yuuki Tachioka,et al.  DISCRIMINATIVE METHODS FOR NOISE ROBUST SPEECH RECOGNITION: A CHIME CHALLENGE BENCHMARK , 2013 .

[12]  Franz Pernkopf,et al.  Representation models in single channel source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[15]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[20]  Israel Cohen,et al.  Relative Transfer Function Identification Using Convolutive Transfer Function Approximation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Franz Pernkopf,et al.  A multi-channel postfilter based on the diffuse noise sound field , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[22]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[23]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  D. M. Campbell,et al.  Springer Handbook of Acoustics , 2015 .