Intonation modeling for TTS using a joint extraction and prediction approach

This paper presents a joint extraction and prediction framework for intonation modeling. The intonation model is based on a superpositional approach using B ézier curves. The components are attached to minor phrase and accent group. A greedy algorithm performs succesive partitions on training data using linguistic information. The parameters related to each partition are obtained using a global optimization procedure. In this way, the extraction process is closely related to the prediction step, and the final performance is higher. Several experiments are performed to test the hypothesis using a two-step intonation modeling procedure for comparison. Results reveal that the prediction accuracy is higher than the reference method. This approach avoids some parameter extraction steps that can produce additional noise, such as the interpolation step used in some intonation models.

[1]  AT BerndMöbius,et al.  COMPONENTS OF A QUANTITATIVE MODEL OF GERMAN INTONATION , 1995 .

[2]  Nancy Ide,et al.  Coding fundamental frequency patterns for multi-lingual synthesis with INTSINT in the MULTEXT project , 1994, Speech Synthesis Workshop.

[3]  Richard Sproat Multilingual Text-to-Speech Synthesis , 1997 .

[4]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[5]  David Escudero Mancebo,et al.  Experimental evaluation of the relevance of prosodic features in Spanish using machine learning techniques , 2003, INTERSPEECH.

[6]  Gregor Möhler,et al.  Parametric modeling of intonation using vector quantization , 1998, SSW.

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[9]  Antonio Bonafonte,et al.  Automatic Analysis and Synthesis of Fujisaki's Intonation Model for TTS , 2002 .

[10]  John Hart,et al.  A Perceptual Study of Intonation , 1990 .

[11]  Antonio Bonafonte,et al.  Joint extraction and prediction of fujisaki's intonation model parameters , 2004, INTERSPEECH.

[12]  David Escudero Mancebo,et al.  Corpus based extraction of quantitative prosodic parameters of stress groups in Spanish , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.