A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques

This paper presents a very low bit rate speech coder based on HMM (hidden Markov model). The encoder carries out phoneme recognition, and transmits phoneme indexes, state durations and pitch information to the decoder. In the decoder, phoneme HMMs are concatenated according to the phoneme indexes, and a sequence of mel-cepstral coefficient vectors is generated from the concatenated HMM by using an ML-based speech parameter generation technique. Finally we obtain synthetic speech by exciting the MLSA (mel log spectrum approximation) filter, whose coefficients are given by mel-cepstral coefficients, according to the pitch information. A subjective listening test shows that the performance of the proposed coder at about 150 bit/s (for the test data including 26% silence region) is comparable to a VQ-based vocoder at 400 bit/s (=8 bit/frame/spl times/50 frame/s) without pitch quantization for both coders.

[1]  Mohamed Ismail,et al.  Between recognition and synthesis - 300 bits/second speech coding , 1997, EUROSPEECH.

[2]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[3]  Richard M. Schwartz,et al.  A segment vocoder at 150 b/s , 1983, ICASSP.

[4]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[5]  Frank K. Soong A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Seiichi Nakagawa,et al.  A lOObit/s speech coding using a speech recognition technique , 1989, EUROSPEECH.

[10]  Isabel Trancoso,et al.  Phonetic vocoding with speaker adaptation , 1997, EUROSPEECH.

[11]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.