Automatic generation and selection of multiple pronunciations for dynamic vocabularies

We present a scheme for the acoustic modeling of speech recognition applications requiring dynamic vocabularies. It applies especially to the acoustic modeling of out-of-vocabulary words which need to be added to a recognition lexicon based on the observation of a few (say one or two) speech utterances of these words. Standard approaches to this problem derive a single pronunciation from each speech utterance by combining acoustic and phone transition scores. In our scheme, multiple pronunciations are generated from each speech utterance of a word to enroll by varying the relative weights assigned to the acoustic and phone transition models. In our experiments, the use of these multiple baseforms dramatically outperforms the standard approach with a relative decrease of the word error rate ranging from 20% to 40% on all our test sets.

[1]  Michael Picheny,et al.  Automatic phonetic baseform determination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Eduardo Lleida,et al.  A user-configurable system for voice label recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Bhuvana Ramabhadran,et al.  Phonological rules for enhancing acoustic enrollment of unknown words , 1998, ICSLP.

[4]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[5]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[6]  Eduardo Lleida,et al.  Speech recognition using automatically derived acoustic baseforms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Bhuvana Ramabhadran,et al.  Acoustics-only based automatic phonetic baseform generation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Bhuvana Ramabhadran,et al.  Acoustics-based baseform generation with pronunciation and/or phonotactic models , 1999, EUROSPEECH.

[10]  Peder A. Olsen,et al.  Theory and practice of acoustic confusability , 2002, Comput. Speech Lang..