Combining outputs of multiple LVCSR models by machine learning

This paper proposes to apply machine learning techniques to the task of combining outputs of multiple LVCSR models, where, as features of machine learning, information such as the models which output the hypothesized word, its part-of-speech, and its syllable length are useful for improving the word recognition rate. Experimental results show that the combination result outperforms several baselines including model combination by voting such as ROVER in the word recognition rate. Furthermore, unlike model combination by voting, word recognition rate of model combination by machine learning is not damaged even in the case where only the minority of the participating models perform well in the word recognition task. © 2005 Wiley Periodicals, Inc. Syst Comp Jpn, 36(10): 9–15, 2005; Published online in Wiley InterScience (www.interscience. wiley.com). DOI 10.1002/scj.20340

[2]  Hermann Ney,et al.  A comparison of word graph and n-best list based confidence measures , 1999, EUROSPEECH.

[3]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[4]  Atsuhiko Kai,et al.  Dealing with out-of-vocabulary words and speech disfluencies in an n-gram based speech understanding system , 1998, ICSLP.

[5]  Takehito Utsuro,et al.  Unsupervised speaker adaptation using high confidence portion recognition results by multiple recognition systems , 2004, INTERSPEECH.

[6]  Seiichi Nakagawa,et al.  Confidence Measures for Speech Recognition by Using Likelihood of Acoustic Model and Language Model , 2001 .

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Shuichi Itahashi,et al.  The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus , 1998, ICSLP.

[10]  Nobuaki Minematsu,et al.  Free software toolkit for Japanese large vocabulary continuous speech recognition , 2000, INTERSPEECH.

[11]  Seiichi Nakagawa,et al.  Evaluation of segmental unit input HMM , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Takehito Utsuro,et al.  Evaluating multiple LVCSR model combination in NTCIR-3 speech-driven web retrieval task , 2003, INTERSPEECH.

[13]  Takehito Utsuro,et al.  Experimental evaluation on confidence of agreement among multiple Japanese LVCSR models , 2001, INTERSPEECH.

[14]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[15]  Takehito Utsuro,et al.  Confidence of agreement among multiple LVCSR models and model combination by SVM , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Takehito Utsuro,et al.  A confidence measure based on agreement among multiple LVCSR models - correlation between pair of acoustic models and confidence , 2002, INTERSPEECH.

[17]  Yasuo Ariki,et al.  Improved speech recognition using iterative decoding based on confidence measures , 2001, INTERSPEECH.