Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis

An objective distance measure which is able to predict audible discontinuity in concatenated speech synthesis systems is very important. Previous works were primarily based on features estimated by linear and/or stationary models of speech. In this paper, we introduce two nonlinear approaches for the detection of discontinuity. The first method is based on a nonlinear harmonic model of speech while the second method is based on the demodulation of speech in an amplitude and a frequency component using the Teager energy operator. Fisher’s linear discriminant was used for the separation of signals with audible discontinuity from those perceived as continuous. When we combined the two methods using Fisher’s linear discriminant a detection rate of 56.5% was achieved which is an 90% improvement over previously published results on the same database.

[1]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[3]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[4]  Justin Fackrell,et al.  Segment selection in the L&h Realspeak laboratory TTS system , 2000, INTERSPEECH.

[5]  Jerome R. Bellegarda A novel discontinuity metric for unit selection text-to-speech synthesis , 2004, SSW.

[6]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[7]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[8]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[9]  Petros Maragos,et al.  Speech nonlinearities, modulations, and energy operators , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Simon King,et al.  Kalman-filter based join cost for unit-selection speech synthesis , 2003, INTERSPEECH.

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[13]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Petros Maragos,et al.  On separating amplitude from frequency modulations using energy operators , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[16]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[17]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.