Alternate phone models for conversational speech

This paper investigates the use of alternate phone models for the transcription of conversational telephone speech. The focus of this work is to explore alternative ways of modeling different manners of speaking so as to better cover the observed articulatory styles and pronunciation variants. Four alternate phone sets are compared ranging from 38 to 129 units. Two of the phone sets make use of syllable-position dependent phone models. The acoustic models were trained on 2300 hours of conversational telephone speech data from the Switchboard and Fisher corpora, and experimental results are reported on the EARS Dev04 test set which contains 3 hours of speech from 36 Fisher conversations. While no one particular phone set was found to outperform the others for a majority of speakers, the best overall performance was obtained with the original 48 phone set and a reduced 38 phone set, however combining the hypotheses of the individual models reduces the word error rate from 17.5% (original phone set) to 16.8%.

[1]  Herbert Gish,et al.  Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Lori Lamel,et al.  On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Hagen Soltau,et al.  The 2003 ISL rich transcription system for conversational telephony speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Mark J. F. Gales,et al.  Development of the 2003 CU-HTK conversational telephone speech transcription system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jean-Luc Gauvain,et al.  Conversational telephone speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Jean-Luc Gauvain,et al.  Continuous Speech Recognition at LIMSI , 1992 .

[9]  Rohit Prasad,et al.  BBN CTS English System , 2003 .

[10]  Jean-Luc Gauvain,et al.  Neural network language models for conversational speech recognition , 2004, INTERSPEECH.

[11]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[12]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[13]  Andreas Stolcke,et al.  The use of a linguistically motivated language model in conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.