Experiments with signal-driven symbolic prosody for statistical parametric speech synthesis

This paper presents a preliminary study on the use of symbolic prosody extracted from the speech signal to improve parameters prediction on HMM-based speech synthesis. The relationship between the prosodic labelling and the actual prosody of the training data is usually ignored in the building phase of corpus based TTS voices. In this work, different systems have been trained using prosodic labels predicted from speech and compared with the conventional system that predicts those labels solely from text. Experiments have been done using data from two speakers (one male and one female). Objective evaluation performed on a test set of the corpora shows that the proposed systems improve the prediction accuracy of phonemes duration and F0 trajectories. Advantages on the use of signal-driven symbolic prosody in place of the conventional text-driven symbolic prosody, and future works about the effective use of these information in the synthesis stage of a Text To Speech systems are also described.

[1]  Takashi Nose,et al.  Discontinuous Observation HMM for Prosodic-Event-Based F0 Generation , 2012, INTERSPEECH.

[2]  Yang Liu,et al.  Automatic prosodic event detection using a novel labeling and selection method in co-training , 2012, Speech Commun..

[3]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4]  Kai Yu,et al.  Joint modelling of voicing label and continuous F0 for HMM based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Takashi Nose,et al.  An F0 modeling technique based on prosodic events for spontaneous speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Mark Hasegawa-Johnson,et al.  An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[9]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Oliver Watts,et al.  The role of higher-level linguistic features in HMM-based speech synthesis , 2010, INTERSPEECH.

[11]  Anne Lacheret,et al.  Towards Improved HMM-based Speech Synthesis Using High-Level Syntactical Features. , 2009 .

[12]  Heiga Zen,et al.  Training a parametric-based logF0 model with the minimum generation error criterion , 2010, INTERSPEECH.

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Mari Ostendorf,et al.  Prediction of abstract prosodic labels for speech synthesis , 1996, Comput. Speech Lang..

[15]  Leonardo Badino,et al.  Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis , 2012, INTERSPEECH.

[16]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[17]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[18]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[19]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Marc Schröder,et al.  Open Source Voice Creation Toolkit for the MARY TTS Platform , 2011, INTERSPEECH.