论文信息 - The 2011 KIT QUAERO speech-to-text system for Spanish

The 2011 KIT QUAERO speech-to-text system for Spanish

This paper describes our current Spanish speech-to-text (STT) system with which we participated in the 2011 Quaero STT evaluation that is being developed within the Quaero program. The system consists of 4 separate subsystems, as well as the standard MFCC and MVDR phoneme based subsystems we included a both a phoneme and grapheme based bottleneck subsystem. We carefully evaluate the performance of each subsystem. After including several new techniques we were able to reduce the WER by over 30% from 20.79% to 14.53%.

Sebastian Stüker | Alexander H. Waibel | Kevin Kilgour | Christian Saam | Christian Mohr

[1] Andreas Stolcke,et al. On using MLP features in LVCSR , 2004, INTERSPEECH.

[2] Tanja Schultz,et al. Speaker segmentation and clustering in meetings , 2004, INTERSPEECH.

[3] Tanja Schultz,et al. Grapheme based speech recognition , 2003, INTERSPEECH.

[4] A. Waibel,et al. A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[5] Vassilios Digalakis,et al. Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[6] Franz Kummert,et al. Grapheme based speech recognition for large vocabularies , 2000, INTERSPEECH.

[7] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8] Puming Zhan,et al. Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Sebastian Stüker,et al. The ISL 2007 English speech transcription system for european parliament speeches , 2007, INTERSPEECH.

[10] Andreas Stolcke,et al. Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[11] Tanja Schultz,et al. The 2010 CMU GALE speech-to-text system , 2010, INTERSPEECH.

[12] Daniel Povey,et al. Improved discriminative training techniques for large vocabulary continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[14] Sebastian Stüker,et al. Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[15] Wen Wang,et al. Techniques for effective vocabulary selection , 2003, INTERSPEECH.

[16] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[17] Brian Kingsbury,et al. Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[19] M. Wolfel,et al. Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[20] Mark J. F. Gales,et al. Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[21] Hermann Ney,et al. Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.