A segmental speech coder based on a concatenative TTS

An extremely low bit rate speech coder based on a recognition/synthesis paradigm is proposed. In our speech coder, the speech signal is produced in a way which is similar to concatenative speech synthesis of text-to-speech (TTS). Hence, database construction, unit selection and prosody modification, which are the major parts of concatenative TTS, are employed to implement the speech coder. The synthesis units are automatically found in a large database using a joint segmentation/classification scheme. Dynamic programming (DP) is applied to unit selection in which two cost functions, an acoustic target cost and a concatenation cost are used to increase naturalness as well as intelligibility. Prosodic differences between the selected unit and the input segment are compensated for by time-scale and pitch modifications which are based on the harmonic plus noise (HNM) model framework. In single speaker tests, the proposed scheme gave intelligible and natural sounding speech at an average bit rate of about 580 b/s.

[1]  A. Wilgus,et al.  The waveform segment vocoder: A new approach for very-low-rate speech coding , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Isabel Trancoso,et al.  Phonetic vocoding with speaker adaptation , 1997, EUROSPEECH.

[3]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[4]  Thierry Dutoit,et al.  Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[5]  Richard V. Cox,et al.  TTS based very low bit rate speech coder , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Gérard Chollet,et al.  Segmental vocoder-going beyond the phonetic approach , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[9]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[11]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[12]  Gérard Benbassat,et al.  Low bit rate speech coding by concatenation of sound units and prosody coding , 1984, ICASSP.

[13]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[14]  Mohamed Ismail,et al.  Between recognition and synthesis - 300 bits/second speech coding , 1997, EUROSPEECH.

[15]  Oscal T.-C. Chen,et al.  A 0.75 kbps speech codec using recognition and synthesis schemes , 1997, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding.

[16]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[17]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[18]  Hisashi Kawai,et al.  Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[19]  Robert M. Gray,et al.  Matrix quantizer design for LPC speech using the generalized Llyod algorithm , 1985, IEEE Trans. Acoust. Speech Signal Process..

[20]  Aggelos K. Katsaggelos,et al.  MPEG-4 and rate-distortion-based shape-coding techniques , 1998, Proc. IEEE.

[21]  Richard V. Cox,et al.  A very low bit rate speech coder based on a recognition/synthesis paradigm , 2001, IEEE Trans. Speech Audio Process..