Developing STT and KWS systems using limited language resources

This paper presents recent progress in developing speech-totext (STT) and keyword spotting (KWS) systems for the 2014 IARPA-Babel evaluation. Systems have been developed for the limited language pack condition for four of the five development languages in this program phase: Assamese, Bengali, Haitian Creole and Zulu. The systems have several novel characteristics that support rapid development of KWS systems. On the STT side different acoustic units are explored based on phonemic or graphemic representations, and system combination is used to improve STT performance. The acoustic models are trained on only 10 hours of speech data with manual transcriptions, completed with unsupervised training on additional untranscribed data. Both word and subword units (morphologically decomposed, syllables, phonemes) are used for KWS. The KWS systems are based on the multi-hypotheses produced by a consensus network decoding or searching word lattices. The word error rates of the individual STT systems are on the order of 50-60%, and the KWS systems obtain Maximum Term Weighted Values ranging from 30-45% for all keywords (invocabulary and out-of-vocabulary (OOV)). Sub-word units are shown to be successful at locating some of the OOV keywords, and system combination improves system performance. Index Terms: STT, KWS, semi-supervised training, lattice, consensus network, sub-word lexical units, Morfessor

[1]  Jan Cernocký,et al.  But neural network features for spontaneous Vietnamese in BABEL , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[3]  Richard M. Schwartz,et al.  Normalizationofphonetic keyword search scores , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[5]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[6]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[7]  Richard M. Schwartz,et al.  Discriminative semi-supervised training for keyword search in low resource languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Lin-Shan Lee,et al.  Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks , 2006, ISCSLP.

[9]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[10]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[11]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  Laurent Besacier,et al.  Word/sub-word lattices decomposition and combination for speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jean-Luc Gauvain,et al.  Comparing decoding strategies for subword-based keyword spotting in low-resourced languages , 2014, INTERSPEECH.

[14]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Ebru Arisoy,et al.  Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages , 2007, HLT-NAACL.

[16]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[17]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[18]  Jean-Luc Gauvain,et al.  Improved models for Mandarin speech-to-text transcription , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.