Current status of the IBM Trainable Speech Synthesis System

This paper describes the current status of the IBM Trainable Speech Synthesis System. The system is a state-of-the-art, trainable, unit-selection based concatenative speech synthesiser. The system uses hidden Markov models (HMMs) to provide a phonetic transcription and HMM state alignment of a database of single-speaker continuous-speech training data. The runtime synthesiser uses the HMM state sized segments that result as its basic synthesis units. It determines which segments to concatenate to produce a target sentence using decision trees built from the training data and a dynamic programming search to optimise a perceptually motivated cost function. The synthesiser can operate both in general domain Text-to-Speech mode, and in Phrase Splicing mode to provide higher quality synthesis in limited domains. Systems have been built in at least 10 different languages and over 70 voices.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[4]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Salim Roukos,et al.  Phrase splicing and variable substitution using the IBM trainable speech synthesis system , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[7]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[8]  Philip C. Woodland,et al.  A hidden Markov-model-based trainable speech synthesizer , 1999, Comput. Speech Lang..

[9]  E Abberton,et al.  First applications of a new laryngograph. , 1971, Medical & biological illustration.

[10]  Robert E. Donovan Segment pre-selection in decision-tree based speech synthesis systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[12]  Michael Picheny,et al.  Context dependent vector quantization for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.