Speech recognition without a lexicon - bridging the gap between graphemic and phonetic systems

Modern speech recognizers rely on three core components: an acoustic model, a language model, and a pronunciation lexicon. In order to expand speech recognition capability to lowresource languages and domains, techniques to peel away the expert knowledge required to craft these three components have been growing in popularity. In this paper, we present a method for automatically learning a weighted pronunciation lexicon in a data-driven fashion without assuming the existence of any phonetic lexicon whatsoever. Given an initial grapheme acoustic model, our method utilizes a novel technique for semiconstrained acoustic unit decoding, which is used to help train a letter to sound (L2S) model. The L2S model is then used in conjunction with a Pronunciation Mixture Model (PMM) to infer a pronunciation lexicon. We evaluate our method on English as well as Lao and Haitian, two low-resource languages featured in the IARPA Babel program. Index Terms: lexicon learning, pronunciation modeling

[1]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[2]  Kai Feng,et al.  Approaches to automatic lexicon learning with limited training examples , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  Ramya Rasipuram,et al.  Combining Acoustic Data Driven G2P and Letter-to-Sound Rules for Under Resource Lexicon Generation , 2012, INTERSPEECH.

[5]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[6]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[7]  James R. Glass,et al.  Learning Lexicons From Speech Using a Pronunciation Mixture Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jean-Luc Gauvain,et al.  Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[9]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[10]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yu Zhang,et al.  Joint Learning of Phonetic Units and Word Pronunciations for ASR , 2013, EMNLP.

[12]  Liang Lu,et al.  Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.