Hierarchical hybrid language models for open vocabulary continuous speech recognition using WFST

One of the main challenges in automatic speech recognition is recognizing an open, partly unseen vocabulary. To implicitly reduce the out-of-vocabulary (OOV) rate, hybrid vocabularies consisting of full-words and sub-words are used. Nevertheless, when using subwords, OOV rates are not necessarily zero. In this work, we propose the use of separate character level graphones (orthography and phoneme sequence pair) as sub-words to effectively obtain zero OOV rate. To minimize negative effects on the core vocabulary of the most frequent words, a hierarchical language modeling approach is proposed. We augment the first level hybrid language model with an OOV word class, which is replaced by character level graphone sequences using a second-level graphone based character language and acoustic model during search. This approach is realized on-the-fly using weighted finite state transducers. We recognize a significant fraction of OOVs on the Wall Street Journal corpus, compared to the full-word and former hybrid language model based approaches.

[1]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[2]  Keith Vertanen Combining open vocabulary recognition and word confusion networks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[4]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[6]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[7]  Hermann Ney,et al.  A comparative analysis of dynamic network decoding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[9]  Ricard V. Solé,et al.  Zipf's Law and Random Texts , 2002, Adv. Complex Syst..

[10]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[12]  Johan Schalkwyk,et al.  Filters for Efficient Composition of Weighted Finite-State Transducers , 2010, CIAA.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[15]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.