Correcting phoneme recognition errors in learning word pronunciation through speech interaction

This paper presents a method called Interactive Phoneme Update (IPU) that enables users to teach systems the pronunciation (phoneme sequences) of words in the course of speech interaction. Using the method, users can correct mis-recognized phoneme sequences by repeatedly making correction utterances according to the system responses. The originalities of this method are: (1) word-segment-based correction that allows users to use word segments for locating mis-recognized phonemes based on open-begin-end dynamic programming matching and generalized posterior probability, (2) history-based correction that utilizes the information of phoneme sequences that were recognized and corrected previously in the course of interactive learning of each word. Experimental results show that the proposed IPU method reduces the error rate by a factor of three over a previously proposed maximum-likelihood-based method.

[1]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Bhiksha Raj,et al.  A joint decoding algorithm for multiple-example-based addition of words to a pronunciation lexicon , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[5]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Steven Greenberg,et al.  Automatic phonetic transcription of spontaneous speech (american English) , 2000, INTERSPEECH.

[7]  Alexander H. Waibel,et al.  A dialogue approach to learning object descriptions and semantic categories , 2008, Robotics Auton. Syst..

[8]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[9]  Vishwa Gupta,et al.  Application of simultaneous decoding algorithms to automatic transcription of known and unknown words , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[11]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[12]  Seiichi Nakagawa DP-4-2 Spontaneous Speech Recognition : its challenge and limit , 2006 .

[13]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[14]  H. Sakoe,et al.  Two-level DP-matching algorithm-a dynamic programming based pattern matching algorithm for continuous speech recognition , 1979 .

[15]  Reinhold Häb-Umbach,et al.  Automatic transcription of unknown words in a speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Juan Carlos Torrecilla,et al.  Name dialing using final user defined vocabularies in mobile (GSM and TACS) and fixed telephone networks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[18]  Mikio Nakano,et al.  Correction of phoneme recognition errors in word learning through speech interaction , 2010, 2010 IEEE Spoken Language Technology Workshop.

[19]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[20]  Stephanie Seneff,et al.  Automatic Acquisition of Names Using Speak and Spell Mode in Spoken Dialogue Systems , 2003, NAACL.

[21]  Issam Bazzi,et al.  A MULTI-CLASS APPROACH FOR MODELLING , 2002 .

[22]  F. K. Soong Generalized word posterior probability (GWPP) for measuring reliability of recognized words , 2004 .

[23]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[24]  Takashi Nose,et al.  Grounding New Words on the Physical World in Multi-Domain Human-Robot Dialogues , 2010, AAAI Fall Symposium: Dialog with Robots.

[25]  Frank K. Soong,et al.  Optimizing baseforms for HMM-based speech recognition , 1995, EUROSPEECH.

[26]  Lou Boves,et al.  Automatic phonetic transcription of large speech corpora , 2007, Comput. Speech Lang..

[27]  Christina Leitner,et al.  Example-Based Automatic Phonetic Transcription , 2010, LREC.