Conversational telephone speech recognition

This paper describes the development of a speech recognition system for the processing of telephone conversations, starting with a state-of-the-art broadcast news transcription system. We identify major changes and improvements in acoustic and language modeling, as well as decoding, which are required to achieve state-of-the-art performance on conversational speech. Some major changes on the acoustic side include the use of speaker normalization (VTLN), the need to cope with channel variability, and the need for efficient speaker adaptation and better pronunciation modeling. On the linguistic side the primary challenge is to cope with the limited amount of language model training data. To address this issue we make use of a data selection technique, and a smoothing technique based on a neural network language model. At the decoding level lattice rescoring and minimum word error decoding are applied. On the development data, the improvements yield an overall word error rate of 24.9% whereas the original BN transcription system had a word error rate of about 50% on the same data.

[1]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Reinhold Häb-Umbach,et al.  A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Thomas Niesler,et al.  The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Mari Ostendorf,et al.  Relevance weighting for combining multi-domain data for n-gram language modeling , 1999, Comput. Speech Lang..

[7]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[8]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[9]  Jean-Luc Gauvain,et al.  Fast decoding for indexation of broadcast data , 2000, INTERSPEECH.

[10]  Amro El-Jaroudi,et al.  Parameter optimization for vocal tract length normalization , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Andrej Ljolje,et al.  The AT&T LVCSR-2000 System , 2000 .

[12]  Thomas Hain,et al.  THE CU-HTK MARCH 2000 HUB5E TRANSCRIPTION SYSTEM , 2000 .

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[15]  Herbert Gish,et al.  The 2001 BYBLOS English large vocabulary conversational speech recognition system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.