Deep Beamforming and Data Augmentation for Robust Speech Recognition: Results of the 4th CHiME Challenge

Robust automatic speech recognition in adverse environments is a challenging task. We address the 4 CHiME challenge [1] multi-channel tracks by proposing a deep eigenvector beamformer as front-end. To train the acoustic models, we propose to supplement the beamformed data by the noisy audio streams of the individual microphones provided in the real set. Furthermore, we perform data augmentation by modulating the amplitude and time-scale of the audio. Our proposed system achieves a word error rate of 4.22% on the real development and 8.98% on the real evaluation data for 6-channels and 6.45% and 13.69% for 2-channels, respectively.

[1]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[2]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[3]  Franz Pernkopf,et al.  Representation Learning for Single-Channel Source Separation and Bandwidth Extension , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Franz Pernkopf,et al.  DNN-based speech mask estimation for eigenvector beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Walter Kellermann,et al.  Analysis of blocking matrices for generalized sidelobe cancellers for non-stationary broadband signals , 2002, ICASSP.

[6]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[7]  Jingdong Chen,et al.  Acoustic MIMO Signal Processing , 2006 .

[8]  S. Gannot,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2004, IEEE Trans. Speech Audio Process..

[9]  Jingdong Chen,et al.  Microphone Array Signal Processing , 2008 .

[10]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..