Automatic baseform generation from acoustic data

We describe two algorithms for generating pronunciation networks from acoustic data. One is based on raw phonetic recognition and the other uses the spelling of the words and the identification of their language of origin as guides. In both cases, a pruning and voting procedure distills the noisy phonetic sequences into pronunciation networks. Recognition experiments on two large, grammar-based, test sets show a reduction of sentence error rates between 2% and 14%, and of word error rate between 3% to 23% when the learned baseforms are added to our baseline lexicons.

[1]  Sabine Deligne,et al.  On the use of lattices for the automatic generation of pronunciations , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  François Yvon Grapheme-to-Phoneme Conversion using Multiple Unbounded Overlapping Chunks , 1996, ArXiv.

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[5]  Benoît Maison,et al.  A robust high accuracy speech recognition system for mobile applications , 2002, IEEE Trans. Speech Audio Process..

[6]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .