A TD-PSOLA based method for speech synthesis and compression

Mobility and cost restrictions of current text-to-speech systems stop them from being used by people with speech impairments all over the world. Therefore new ways to improve mobility and lower cost have to be developed. This can be done by decreasing the computational resources used by speech synthesis systems. Non-parametric concatenative synthesis techniques provide the easiest way to generate artificial speech with high quality. Although, they can be, in general, computationally efficient (e.g., TD-PSOLA) they are not always suited for implementation on embedded devices because they require rather large recorded speech data-bases. A big part of the recorded speech data is represented by the samples of the vowels. Therefore, compression ratios of at least 25% can be achieved for Romanian, by removing all these samples but one overlap-add (OLA) frame. At synthesis, the remaining vowel is used to generate the original sound. The paper presents a new method for the generation and the compression of vowels, starting from only one OLA frame and using TD-PSOLA in new way. Experiments show that by appropriately choosing pitch and amplitude jitter models, high quality synthetic speech can be achieved.

[1]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Eugene Coyle,et al.  High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA) , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  A. Wilgus,et al.  The waveform segment vocoder: A new approach for very-low-rate speech coding , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Arild Lacroix,et al.  Time-varying linear prediction for speech analysis and synthesis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Steve McLaughlin,et al.  Speech characterization and synthesis by nonlinear methods , 1999, IEEE Trans. Speech Audio Process..

[8]  Diamantino Freitas,et al.  Portable implementation of a text-to-speech system for portuguese , 2008, 2008 16th European Signal Processing Conference.

[9]  Thierry Dutoit,et al.  MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database , 1993, Speech Commun..

[10]  Francis Charpentier,et al.  Diphone synthesis using an overlap-add technique for speech waveforms concatenation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Toma Stefan-Adrian,et al.  Rule-Based Automatic Phonetic Transcription for the Romanian Language , 2009, 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

[12]  Masaaki Honda,et al.  Sinusoidal model based on instantaneous frequency attractors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.