UNSUPERVISED PRONUNCIATION ADAPTATION FOR OFF-LINE TRANSCRIPTION OF JAPANESE LECTURE SPEECHES

Observing that most variations in pronunciation are strongly speaker and speaking style dependent, and that the introduction of pronunciation variants in a speaker-independent recognition system is of limited success, we refrain from applying multiple pronunciation variants in the speakerindependent case and instead introduce pronunciation variants without supervision when specializing the recognizer for a specific speaker. Our approach is to take the decoder’s output after a first recognition pass and to realign it allowing several commonly observed pronunciation variations. In a second decoding pass, the pronunciation variations are integrated into the recognizer, weighted using Maximum Likelihood estimates for the pronunciation variants’ likelihoods on the realigned output of the first pass. We observe a small but significant improvement in recognition accuracy compared to the first pass output and conclude that the method is helpful in adjusting the pronunciation modeling structure according to speaker, speaking style and speaking rate. A better prior choice of possible pronunciation variations involving deeper phonetic knowledge would be beneficial for further improvements. We also show experimentally that the improvement gained through pronunciation adaptation does not overlap much with the improvement gained by unsupervised adaptation of the acoustic models, but rather that the achieved WER reductions are additive.

[1]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[2]  Akio Ando,et al.  A new method for automatic generation of speaker-dependent phonological rules , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Tilo Sloboda Dictionary learning: performance through consistency , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Florian Schiel,et al.  Statistical Modelling Of Pronunciation: It's Not The Model, It's The Data , 1998 .

[5]  Shuichi Itahashi,et al.  The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus , 1998, ICSLP.

[6]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[7]  M. Wolff,et al.  Data-driven generation of pronunciation dictionaries in the German Verbmobil project: discussion of experimental results , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Gerhard Rigoll,et al.  Frame-discriminative and confidence-driven adaptation for LVCSR , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Sanjeev Khudanpur,et al.  Pronunciation ambiguity vs. pronunciation variability in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  I. Lee Hetherington,et al.  An efficient implementation of phonological rules using finite-state transducers , 2001, INTERSPEECH.

[11]  Thomas Niesler,et al.  Unsupervised language model adaptation for lecture speech transcription , 2002, INTERSPEECH.