Multivariate Gaussian based speech compensation or mapping has been developed to reduce the mismatch between training and deployment conditions for robust speech recognition. The acoustic mapping procedure can be formulated as a feature space adaptation where a noisy input signal is transformed by a multivariate Gaussian network. We propose a novel algorithm to update the network parameters based on minimizing the Kullback-Leibler distance between the core recognizer's acoustic model and transformed features. It is designed to achieve optimal overall system performance rather than MMSE on a specific feature domain. An online stochastic gradient descent learning rule is derived. We evaluate the performance of the new algorithm using a JRTk broadcast news system on a distance-talking speech corpus and compare its performance with that of previous MMSE based approaches. The experiments show the KL based approach is more effective for a large vocabulary continuous speech recognition (LVCSR) system.
[1]
Li Deng,et al.
Efficient on-line acoustic environment estimation for FCDCN in a continuous speech recognition system
,
2001,
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[2]
Alex Waibel,et al.
Streamlining the front end of a speech recognizer
,
2000,
INTERSPEECH.
[3]
Philip C. Woodland,et al.
Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
,
1995,
Comput. Speech Lang..
[4]
Evandro B. Gouvêa,et al.
Multivariate-Gaussian-based cepstral normalization for robust speech recognition
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.