Speech recognition complexity reduction using decimation of cepstral time trajectories

The usage of speech recognition technology has become common in a variety of applications ranging from desktop computers with dictation engines to mobile devices with speaker-dependent name dialing. While dictation software is solely run on powerful desktop PCs with huge amounts of memory available mobile devices have limited memory and computational resources. In order to implement speech recognition algorithms into mobile devices, the complexity of the algorithms has to meet the capabilities of the device. This paper addresses the problem of complexity and memory constraints in mobile devices. A specific approach called time domain decimation of feature vectors is presented. This general signal processing technique can be applied to speech recognition due to the band-limited modulation spectrum of the feature vector time trajectories. By decimating the feature vector stream of 100 frames per second by factors of 2 to 5, the complexity of the speech recognizer can be reduced proportionally to the decimation factor. Experiments with name dialing task show that decimation factor of 4 can be used without any significant degradation in the performance of the speech recognizer. With the proposed method, the computational complexity can be reduced by 70% and over 60% save in RAM usage can be obtained.

[1]  Olli Viikki,et al.  A recursive feature vector normalization approach for robust speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[3]  Sarel van Vuuren,et al.  On the importance of components of the modulation spectrum for speaker verification , 1998, ICSLP.

[4]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[5]  Misha Pavel,et al.  On the importance of various modulation frequencies for speech recognition , 1997, EUROSPEECH.

[6]  Kari Laurila,et al.  Noise robust speech recognition with state duration constraints , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Emmanuel Ifeachor,et al.  Digital Signal Processing: A Practical Approach , 1993 .

[8]  Hervé Bourlard,et al.  Optimizing recognition and rejection performance in wordspotting systems , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.