Natural number recognition using MCE trained inter-word context dependent acoustic models

Among applications that require number recognition, the focus has largely been on connected digit recognizers. We introduce an acoustic model topology for natural number recognition by using minimum classification error (MCE) training of inter-word context dependent models of the head-body-tail (HBT) type. Experimental results on natural number applications involving dollar amounts and US telephone numbers show that using HBT models for natural number data reduces string error rates by as much as 25% over context independent whole word models. In addition, for speech input which is strictly of connected digit type, the increase in string error rates is negligible when a natural number telephone grammar is used instead of a connected digit telephone grammar. This will enable natural number speech recognition systems to be more widely accepted because the recognition accuracy is maintained while permitting a more natural and flexible user interface.

[1]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Nelson Morgan,et al.  Scaling down: applying large vocabulary hybrid HMM-MLP methods to telephone recognition of digits and natural numbers , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[3]  Francisco Javier Caminero Gil,et al.  Recognition of spontaneously spoken connected numbers in Spanish over the telephone line , 1995, EUROSPEECH.

[4]  Francisco Javier Caminero Gil,et al.  On-line garbage modeling for word and utterance verification in natural numbers recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Jay G. Wilpon,et al.  Modeling state durations in hidden Markov models for automatic speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jay G. Wilpon,et al.  A grammar compiler for connected speech recognition , 1991, IEEE Trans. Signal Process..

[7]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[8]  Jay G. Wilpon,et al.  Automatic recognition of Danish natural numbers for telephone applications , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.