Large vocabulary automatic speech recognition might assist hearing impaired telephone users by displaying a transcription of the incoming side of the conversation, but the system would have to achieve su cient accuracy on conversationalstyle, telephone-bandwidth speech. We describe our development work toward such a system. This work comprised three phases: Experiments with clean data ltered to 200-3500Hz, experiments with real telephone data, and language model development. In the rst phase, the speaker independent error rate was reduced from 25% to 12% by using MLLT, increasing the number of cepstral components from 9 to 13, and increasing the number of Gaussians from 30,000 to 120,000. The resulting system, however, performed less well on actual telephony, producing an error rate of 28.4%. By additional adaptation and the use of an LDA and CDCN combination, the error rate was reduced to 19.1%. Speaker adaptation reduces the error rate to 10.96%. These results were obtained with read speech. To explore the language-model requirements in a more realistic situation, we collected some conversational speech with an arrangement in which one participant could not hear the conversation but only saw recognizer output on a screen. We found that a mixture of language models, one derived from the Switchboard corpus and the other from prepared texts, resulted in approximately 10% fewer errors than either model alone.
[1]
Bernard Mérialdo,et al.
A Dynamic Language Model for Speech Recognition
,
1991,
HLT.
[2]
Ramesh A. Gopinath,et al.
Maximum likelihood modeling with Gaussian distributions for classification
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[3]
Alejandro Acero,et al.
Acoustical and environmental robustness in automatic speech recognition
,
1991
.
[4]
Michael Picheny,et al.
Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[5]
Richard M. Stern,et al.
Efficient joint compensation of speech for the effects of additive noise and linear filtering
,
1992,
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.