A new distance measure for costing spectral discontinuities in concatenative speech synthesizers

In many modern concatenative speech synthesisers the unit sequence used to synthesise each sentence is determined at runtime by a search algorithm seeking to optimise a multidimensional cost function. One of these costs is usually some form of spectral continuity cost, computed between the end of one segment and the start of the following segment, intended to ensure that the synthetic speech does not contain any unpleasant spectral discontinuities. This paper presents the results of listening tests conducted to evaluate the performance of several possible continuity measures. It also describes a new continuity measure developed at IBM which substantially out-performs all other measures tested.

[1]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[2]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[3]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[5]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Justin Fackrell,et al.  Segment selection in the L&h Realspeak laboratory TTS system , 2000, INTERSPEECH.

[7]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Michael Picheny,et al.  Context dependent vector quantization for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.