Efficient on-line acoustic environment estimation for FCDCN in a continuous speech recognition system

There exists a number of cepstral de-noising algorithms which perform quite well when trained and tested under similar acoustic environments, but degrade quickly under mismatched conditions. We present two key results that make these algorithms practical in real noise environments, with the ability to adapt to different acoustic environments over time. First, we show that it is possible to leverage the existing de-noising computations to estimate the acoustic environment on-line and in real time. Second, we show that it is not necessary to collect large amounts of training data in each environment-clean data with artificial mixing is sufficient. When this new method is used as a pre-processing stage to a large vocabulary speech recognition system, it can be made robust to a wide variety of acoustic environments. With synthetic training data, we are able to reduce the word error rate by 27%.

[1]  Richard M. Stern,et al.  Robust speech recognition by normalization of the acoustic space , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Mazin G. Rahim,et al.  On second order statistics and linear estimation of cepstral coefficients , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Richard M. Stern,et al.  Environment normalization for robust speech recognition using direct cepstral comparison , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yunxin Zhao,et al.  Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises , 2000, IEEE Trans. Speech Audio Process..

[7]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.