Reducing the footprint of the IBM trainable speech synthesis system

This paper presents a novel approach for concatenative speech synthesis. This approach enables reduction of the dataset size of a concatenative text-to-speech system, namely the IBM trainable speech synthesis system, by more than an order of magnitude. A spectral acoustic feature based speech representation is used for computing a cost function during segment selection as well as for speech generation. Initial results indicate that even with a dataset size of a few megabytes it is possible to achieve quality which is significantly higher than existing small footprint formant based synthesizers.

[1]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[2]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[4]  Mahesh Viswanathan,et al.  Current status of the IBM Trainable Speech Synthesis System , 2001, SSW.

[5]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[6]  Robert E. Donovan Segment pre-selection in decision-tree based speech synthesis systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Yannis Stylianou On the implementation of the harmonic plus noise model for concatenative speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[10]  Ron Hoory,et al.  Speech synthesis for a specific speaker based on a labeled speech database , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 2 - Conference B: Computer Vision & Image Processing. (Cat. No.94CH3440-5).

[11]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.