Log-linear model combination with word-dependent scaling factors

Log-linear model combination is the standard approach in LVCSR to combine several knowledge sources, usually an acoustic and a language model. Instead of using a single scaling factor per knowledge source, we make the scaling factor wordand pronunciation-dependent. In this work, we combine three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task. The achieved error rate reduction of 2% relative is small but consistent for two test sets. An analysis of the results shows that the major contribution comes from the improved interdependency of language and acoustic model. Index Terms: speech recognition, model combination, system combination, log-linear modeling, minimum risk training

[1]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Mei-Yuh Hwang,et al.  Unified stochastic engine (USE) for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Dimitra Vergyri,et al.  Integration of multiple knowledge sources in speech recognition using minimum error training , 2001 .

[4]  Georg Heigold,et al.  Recent improvements of the RWTH GALE Mandarin LVCSR system , 2008, INTERSPEECH.

[5]  William J. Byrne,et al.  Minimum risk acoustic clustering for multilingual acoustic model combination , 2000, INTERSPEECH.

[6]  Mark J. F. Gales,et al.  Generating Complementary Systems for Speech Recognition , 2022 .

[7]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  P. Beyerlein Discriminative model combination , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Dana H. Ballard,et al.  Improved spontaneous dialogue recognition using dialogue and utterance triggers by adaptive probability boosting , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Chin-Hui Lee,et al.  Discriminative training of language models for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.