Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates.In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.

[1]  Dennis H. Klatt,et al.  The klattalk text-to-speech conversion system , 1982, ICASSP.

[2]  Shaw-Hwa Hwang,et al.  A Mandarin text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Biing-Hwang Juang,et al.  Optimal quantization of LSP parameters , 1993, IEEE Trans. Speech Audio Process..

[4]  Chen-Yu Chiang,et al.  On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech , 2005, INTERSPEECH.

[5]  H. T. Hu A Pseudo Glottal Excitation Model for the Linear Prediction Vocoder with Speech Signals Coded at 1.6 kbps , 2000 .

[6]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[7]  D G Childers,et al.  Speech synthesis by glottal excited linear prediction. , 1994, The Journal of the Acoustical Society of America.

[8]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[9]  Chiu-yu Tseng,et al.  Fluent speech prosody: Framework and modeling , 2005, Speech Commun..

[10]  Chiu-yu Tseng,et al.  The synthesis rules in a Chinese text-to-speech system , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  K. Paliwal,et al.  Efficient vector quantization of LPC parameters at 24 bits/frame , 1990 .

[12]  Hwai-Tsu Hu,et al.  Design and implantation of an ASIC architecture for 1.6 kbps speech synthesis , 2003, IEEE Trans. Consumer Electron..

[13]  D. Malah,et al.  Speech analysis and synthesis using a glottal excited AR model with DTW-based glottal determination , 1995, Eighteenth Convention of Electrical and Electronics Engineers in Israel.

[14]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Frank Fallside,et al.  A technique for using multipulse linear predictive speech synthesis in text-to-speech type systems , 1987, IEEE Trans. Acoust. Speech Signal Process..

[16]  Yu,et al.  An efficient Mandarin text-to-speech system on time domain , 1998 .

[17]  Chiu-yu Tseng,et al.  A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese , 2002, IEEE Trans. Speech Audio Process..

[18]  Kong Jiangping,et al.  Research on perception of juncture between syllables in Chinese , 1997 .

[19]  Sergio L. Netto,et al.  Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki's model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Sin-Horng Chen,et al.  Neural network synthesiser of pause duration for Mandarin text-to-speech , 1992 .

[21]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[22]  John S. Collura,et al.  MELP: the new Federal Standard at 2400 bps , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[24]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[25]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[26]  Chiu-yu Tseng,et al.  Improved tone concatenation rules in a formant-based Chinese text-to-speech system , 1993, IEEE Trans. Speech Audio Process..

[27]  Hsiao-Wuen Hon,et al.  Yanhui (宴會), a Softwre Based High Performance Mandarin Text-To-Speech System , 1994, ROCLING/IJCLCLP.

[28]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[29]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[30]  Sin-Horng Chen,et al.  Vector quantization of pitch information in Mandarin speech , 1990, IEEE Trans. Commun..