The 2013 KIT IWSLT speech-to-text systems for German and English

This paper describes our English Speech-to-Text (STT) systems for the 2013 IWSLT TED ASR track. The systems consist of multiple subsystems that are combinations of different front-ends, e.g. MVDR-MFCC based and lMel based ones, GMM and NN acoustic models and different phone sets. The outputs of the subsystems are combined via confusion network combination. Decoding is done in two stages, where the systems of the second stage are adapted in an unsupervised manner on the combination of the first stage outputs using VTLN, MLLR, and cMLLR.

[1]  Sebastian Stüker,et al.  Segmentation of Telephone Speech Based on Speech and Non-speech Models , 2013, SPECOM.

[2]  T. Marek Analysis of German Compounds Using Weighted Finite State Transducers , 2006 .

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  P. Fränti,et al.  Iterative split-and-merge algorithm for VQ codebook generation , 1998 .

[6]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Sebastian Stüker,et al.  Quaero 2010 Speech-to-Text Evaluation Systems , 2011, High Performance Computing in Science and Engineering.

[8]  Sebastian Stüker,et al.  The KIT Lecture Corpus for Speech Translation , 2012, LREC.

[9]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[10]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[11]  Tanja Schultz,et al.  The ISL RT04 Mandarin Broadcast News Evaluation System , 2004 .

[12]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[13]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[14]  Tomoki Toda,et al.  The KIT-NAIST (contrastive) English ASR system for IWSLT 2012 , 2012, IWSLT.

[15]  Alexander H. Waibel,et al.  Warped Minimum Variance Distortionless Response based bottle neck features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Tomoki Toda,et al.  The 2012 KIT and KIT-NAIST English ASR systems for the IWSLT evaluation , 2012, IWSLT.

[17]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[18]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Sebastian Stüker,et al.  The ISL 2007 English speech transcription system for european parliament speeches , 2007, INTERSPEECH.

[21]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[23]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..