Hybrid HMM-NN modeling of stationary-transitional units for continuous speech recognition

Abstract This paper describes the benefits in recognition accuracy that can be achieved in a hybrid Hidden Markov Model – Neural Network (HMM–NN) recognition framework by using context-dependent subword units named Stationary–Transitional Units. These units are made up of stationary parts of the context-independent phonemes plus all the admissible transitions between them; they have good generalization capability and capture a wide acoustic detail. These units are very suitable to be modeled with neural networks, can enhance the performances of hybrid HMM–NN systems, and represent a real alternative to the context-independent phonemes. The efficacy of Stationary–Transitional Units is verified for the Italian language on isolated and continuous speech recognition tasks extracted from a real application employed for railway timetable telephonic vocal access. The results show that a relevant improvement is achieved with respect to the use of the context-independent phonemes.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Pietro Laface,et al.  Acoustic-phonetic modeling for flexible vocabulary speech recognition , 1995, EUROSPEECH.

[3]  Steve Renals,et al.  Recent improvements to the ABBOT large vocabulary CSR system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Morena Danieli,et al.  A robust system for human-machine dialogue in telephony-based applications , 1997, Int. J. Speech Technol..

[5]  D. Albesano,et al.  Speeding up neural network execution: an application to speech recognition , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[6]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Alex Waibel,et al.  Integrating time alignment and neural networks for high performance continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.