Investigation into Joint Optimization of Single Channel Speech Enhancement and Acoustic Modeling for Robust ASR

This paper investigates the joint optimization of single channel speech enhancement and the acoustic model of a hybrid DNN-HMM system for noise robust ASR. Two enhancement methods are investigated. A masking of the noisy speech signal with a speech mask estimated by a DNN based mask estimator, as well as a parametric Wiener filter employing a DNN based noise estimator and a DNN based frame wise estimation of the filter parameters. Those components are jointly optimized with the acoustic model of the ASR system. It is shown that the Wiener filter approach can be used to improve the performance of a state-of-the-art single-channel ASR system on the single channel track of the CHiME-4 data, where the WER of the real ealuation set is reduced from 11.6 % to 10.5 %.

[1]  Masakiyo Fujimoto,et al.  Strategies for distant speech recognitionin reverberant environments , 2015, EURASIP J. Adv. Signal Process..

[2]  Bingxi Wang,et al.  A noise robust front-end using Wiener filter, probability model and CMS for ASR , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[3]  Hermann Ney,et al.  The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation , 2016 .

[4]  Jun Du,et al.  Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jun Du,et al.  On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones , 2017, INTERSPEECH.

[6]  Reinhold Häb-Umbach,et al.  Optimizing neural-network supported acoustic beamforming by algorithmic differentiation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[10]  Jasha Droppo,et al.  A noise-robust ASR front-end using Wiener filter constructed from MMSE estimation of clean speech and noise , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[12]  Shinji Watanabe,et al.  Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline , 2018, INTERSPEECH.

[13]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[15]  Hermann Ney,et al.  Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[16]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[18]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).