Correction of phoneme recognition errors in word learning through speech interaction

This paper describes a novel method that enables users to teach systems the phoneme sequences of new words through speech interaction. Using the method, users can correct mis-recognized phoneme sequences incrementally by making corrective utterances. Each corrective utterance may include the whole or a segment of the word. During the interaction, if the correction using the utterance results in a better phoneme sequence than the previous one, a user can stop the interaction or make a corrective utterance again. Otherwise the user can reject the utterance. The originalities of this method are 1) interactive correction by speech, 2) the use of spoken word segments for locating mis-recognized phonemes and, 3) the use of generalized posterior probability (GPP) as a measure of correcting mis-recognized phonemes. The experimental results show that the proposed method achieved 96.8% in phoneme accuracy and 79.1% in word accuracy, with less than seven corrective utterances.

[1]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[2]  Stephanie Seneff,et al.  Automatic Acquisition of Names Using Speak and Spell Mode in Spoken Dialogue Systems , 2003, NAACL.

[3]  J. Makhoul,et al.  Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  F. K. Soong Generalized word posterior probability (GWPP) for measuring reliability of recognized words , 2004 .

[5]  Lou Boves,et al.  Automatic phonetic transcription of large speech corpora , 2007, Comput. Speech Lang..

[6]  Zhigang Cao,et al.  Phonetic transcription verification with generalized posterior probability , 2005, INTERSPEECH.

[7]  Alexander H. Waibel,et al.  A dialogue approach to learning object descriptions and semantic categories , 2008, Robotics Auton. Syst..

[8]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[9]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[10]  H. Sakoe,et al.  Two-level DP-matching algorithm-a dynamic programming based pattern matching algorithm for continuous speech recognition , 1979 .

[11]  Ronald A. Cole,et al.  Creating speaker-specific phonetic templates with a speaker-independent phonetic recognizer: implications for voice dialing , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Takashi Nose,et al.  Learning lexicons from spoken utterances based on statistical model selection , 2009, INTERSPEECH.

[13]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .