Continuous speech recognition based on high plausibility regions

The authors propose an approach to phoneme-based continuous speech recognition when a time function of the plausibility of observing each phoneme (spotting result) is given. They introduce a criterion for the best sentence, based on the sum of plausibilities of individual symbols composing the sentence. Based on the idea of making use of high plausibility regions to reduce the computational load while maintaining optimality, the method finds the most plausible sentences relating to the input speech. Two optimization procedures are defined to deal with the following embedded search processes: (1) finding the best path connecting peaks of the plausibility functions of two successive symbols, and (2) finding the best time transition slot index for two given peaks. Experimental results show that the method gives better recognition precision while requiring about 1/20 of the computing time of the traditional DP-based methods. The experimental system obtained a 95% sentence recognition rate on a multispeaker test.<<ETX>>

[1]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Yifan Gong,et al.  Phoneme-based continuous speech recognition without pre-segmentation , 1987, ECST.

[3]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[4]  J.-P. Haton,et al.  Neural network coupled with IIR sequential adapter for phoneme recognition in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yifan Gong,et al.  Signal-to-String Conversion Based on High Likelihood Regions Using Embedded Dynamic Programming , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Alex Waibel,et al.  Large vocabulary recognition using linked predictive neural networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[7]  H. Sawai,et al.  Spotting Japanese CV-syllables and phonemes using the time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[8]  J.-P. Haton,et al.  Non-linear vector interpolation by neural network for phoneme identification in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Carlo Scagliola,et al.  Two-step recognition of large vocabulary isolated words based on diphone spotting , 1989, EUROSPEECH.