Concatenative synthesis based on a harmonic model

One of the most successful approaches to synthesizing speech, concatenative synthesis, combines recorded speech units to build full utterances. However, the prosody of the stored units is often not consistent with that of the target utterance and must be altered. Furthermore, several types of mismatch can occur at unit boundaries and must be smoothed. Thus, both pitch and time-scale modification techniques as well as smoothing algorithms play a crucial role in such concatenation based systems. In this paper, we describe novel approaches to each of these issues. First, we present a conceptually simple technique for pitch and time-scale modification of speech. Our method is based upon a harmonic coding of each speech frame, and operates entirely within the original sinusoidal model. Crucially, it makes no use of "pitch pulse onset times." Instead, phase coherence, and thus shape invariance, is ensured by exploiting the harmonic relation existing between the sine waves used to code each analysis frame so that their phases at each synthesis frame boundary are consistent with those derived during analysis. Secondly, a smoothing algorithm, aimed specifically at correcting phase mismatches at unit boundaries, is described. Results are presented showing our prosodic modification techniques to be highly suitable for use within a concatenative speech synthesizer.

[1]  Yannis Stylianou Concatenative speech synthesis using a harmonic plus noise model , 1998, SSW.

[2]  Yannis Stylianou Removing phase mismatches in concatenative speech synthesis , 1998, SSW.

[3]  Darragh O'Brien,et al.  Shape invariant pitch modification of speech using a harmonic model , 1999, EUROSPEECH.

[4]  Carmen García Mateo,et al.  Concatenative Text‐to‐Speech Synthesis Based on Sinusoidal Modelling , 2002 .

[5]  Eric Moulines,et al.  High-quality speech modification based on a harmonic + noise model , 1995, EUROSPEECH.

[6]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[7]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[9]  Gérard Bailly A Parametric Harmonic + Noise Model , 2002 .

[10]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[11]  E. Bryan George An analysis-by-synthesis approach to sinusoidal modeling applied to speech and music signal processing , 1991 .

[12]  Mark J. T. Smith,et al.  Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model , 1997, IEEE Trans. Speech Audio Process..

[13]  Ann K. Syrdal,et al.  Diphone synthesis using unit selection , 1998, SSW.

[14]  Darragh O'Brien,et al.  Shape invariant time-scale modification of speech using a harmonic model , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[16]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..