Control of spectral dynamics in concatenative speech synthesis

Current speech synthesis methods based on the concatenation of waveform units can produce highly intelligible speech capturing the identity of a particular speaker. However, the quality of concatenated speech often suffers from discontinuities between the acoustic units, due to contextual differences and variations in speaking style across the database. In this paper, we present methods to spectrally modify speech units in a concatenative synthesizer to correspond more closely to the acoustic transitions observed in natural speech. First, a technique called "unit fusion" is proposed to reduce spectral mismatch between units. In addition to concatenation units, a second, independent tier of units is selected that defines the desired spectral dynamics at concatenation points. Both unit tiers are "fused" to obtain natural transitions throughout the synthesized utterance. The unit fusion method is further extended to control the perceived degree of articulation of concatenated units. A signal processing technique based on sinusoidal modeling is also presented that enables high-quality resynthesis of units with a modified spectral shape.

[1]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[2]  T. Gay Effect of speaking rate on diphthong formant movements. , 1968, The Journal of the Acoustical Society of America.

[3]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[4]  Hynek Hermansky,et al.  Spectral envelope sampling and interpolation in linear predictive analysis of speech , 1984, ICASSP.

[5]  R. McAulay Maximum likelihood spectral estimation and its application to narrow-band speech coding , 1984 .

[6]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[7]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[8]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[9]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[10]  Klaus J. Kohler,et al.  Segmental Reduction in Connected Speech in German: Phonological Facts and Phonetic Explanations , 1990 .

[11]  Bayya Yegnanarayana,et al.  Formant extraction from group delay function , 1991, Speech Commun..

[12]  M. Fourakis,et al.  Tempo, stress, and vowel reduction in American English. , 1991, The Journal of the Acoustical Society of America.

[13]  Björn Granström,et al.  The use of speech synthesis in exploring different speaking styles , 1992, Speech Commun..

[14]  Dick R. van Bergem,et al.  Acoustic vowel reduction as a function of sentence accent, word stress, and word class , 1993, Speech Commun..

[15]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.

[17]  Kuldip K. Paliwal,et al.  Interpolation properties of linear prediction parametric representations , 1995, EUROSPEECH.

[18]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[20]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[21]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[22]  Ann K. Syrdal,et al.  Diphone synthesis using unit selection , 1998, SSW.

[23]  Alex Acero,et al.  HMM-based smoothing for concatenative speech synthesis , 1998, ICSLP.

[24]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[25]  Peter Jackson,et al.  Non-uniform unit selection and the similarity metric within BT's Laureate TTS system , 1998, SSW.

[26]  Michael W. Macon,et al.  Spectral modification for concatenative speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .