The submitted system for CHiME-5 challenge focuses on implementing a better front-end for an automatic speech recognition (ASR) system trained on the data provided by CHiME5. In this work, we focus on using non-negative matrix factorization (NMF) based technique to denoise and dereverberation. In Approach 1, the degraded single-channel speech utterances were enhanced using multi-channel Weighted prediction error (WPE) or NMF followed by a minimum variance distortionless response (MVDR) beamformer to obtain an enhanced signal. In Approach 2, we used multi-channel MVDR followed by a NMF based single-channel enhancement. Using the baseline acoustic model (AM), these enhanced speech utterances did not provide improved WER compared to the baseline Beamformit based system. So, we retrained the AM using WPE enhanced data for training (Approach 3). These approaches were able to improve the ASR results as compared to baseline. We are submitting results for the single-array track and only focus on acoustic robustness (i.e., ranking A). 1. Degradation model The CHiME-5 recordings were done for conversational speech happening in a dinner party scenario [1]. Four participants were present at each of these dinner parties. The speakers were asked to have normal conversations. The speech was recorded using six Kinect microphone arrays placed in different locations in the room. The duration of each dinner party was at least 1.5 hours. The recordings were degraded by the presence of non-stationary noise, reverberation, overlapping speakers and speaker movements. 1.1. Model for reverberation and noise In the proposed framework, it is assumed that at any time only one speaker is active. Further, it is assumed that the clean speech is degraded due to reverberation and noise. Other degradations like the presence of interfering speakers and speaker movements are not handled. Reverberation in the time domain is modeled as the convolution of the original source with the room impulse response (RIR). Noise is assumed to be additive to reverberant speech. Time domain speech recorded by each microphone y(n) is written as, y(n) = y R(n) + z (n) = s(n) ∗ h(n) + z(n) (1) where, s(n) is the clean utterance, and y R(n), h (n) and z(n) are the reverberated speech, RIR and noise at the i-th microphone, respectively. The proposed NMF enhancements are based on the magnitude spectrogram model for degraded speech in [2]. The NMF enhancement can be performed for any one channel of the microphone array recording or to the output of a multichannel enhancement method. The input degraded spectrogram Y ∈ RK×T is modeled using NMF. Such a model is obtained by utilizing NMF models for clean speech and noise spectrograms along with a separability assumption on RIR spectrogram H(k, n) = H1(k)H2(n). The NMF models for clean speech S and noise spectrograms Z are shown in (2). S = WsXs Z = WnXn (2) where, Ws, Wn represents the bases for clean speech and noise spectrograms, respectively. Xs, Xn represents the activations for clean speech and noise spectrograms, respectively. Using a separability assumption on RIR spectrogram and the NMF models for noise and clean speech spectrograms, the degraded speech spectrogram can be represented as,
[1]
Bhiksha Raj,et al.
Speech denoising using nonnegative matrix factorization with priors
,
2008,
2008 IEEE International Conference on Acoustics, Speech and Signal Processing.
[2]
Jon Barker,et al.
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
,
2018,
INTERSPEECH.
[3]
Preeti Rao,et al.
A Non-convolutive NMF Model for Speech Dereverberation
,
2018,
INTERSPEECH.
[4]
Xavier Anguera Miró,et al.
Acoustic Beamforming for Speaker Diarization of Meetings
,
2007,
IEEE Transactions on Audio, Speech, and Language Processing.
[5]
Biing-Hwang Juang,et al.
Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction
,
2010,
IEEE Transactions on Audio, Speech, and Language Processing.
[6]
Emmanuel Vincent,et al.
A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
,
2017,
IEEE/ACM Transactions on Audio, Speech, and Language Processing.