A new model of phone duration and energy is presented. These parameters are modelled in two stages. The first stage builds a statistics tree that contains phone duration and energy mean and standard deviation values at each node. The branches of the tree are characterised by a set of factors related to phonetic context. The second stage considers phone duration and energy to be modified by two syllable-level prosodic coefficients. The duration and energy of the phones of a syllable are influenced to differing degrees by these coefficients. Weights are associated with the different phone positions in a syllable. A simulated annealing technique is used to find the set of weights that allow the prosodic coefficients to be calculated for all syllables and, in turn, minimise the error in predicting the phone duration and energy during synthesis. They are predicted with a mean squared error of 15.4ms and 6.8dB respectively. During synthesis, the syllable-level prosodic coefficients are predicted by regression trees from linguistic information. Manual prosodic labelling is not required at any stage.
[1]
Stephen Isard,et al.
Segment durations in a syllable frame
,
1991
.
[2]
Benoit Deveaud,et al.
Centre National D'Etudes des Telecommunications Lannion, France
,
1989
.
[3]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[4]
Olivier Boeffard Dosierre.
Segmentation automatique d'unites acoustiques pour la synthese de la parole
,
1993
.
[5]
Paul Christopher Bagshaw,et al.
Automatic prosodic analysis for computer aided pronunciation teaching
,
1994
.
[6]
Lawrence R. Rabiner,et al.
Applications of a nonlinear smoothing algorithm to speech processing
,
1975
.
[7]
Mari Ostendorf,et al.
TOBI: a standard for labeling English prosody
,
1992,
ICSLP.
[8]
L. Ingber.
Adaptive Simulated Annealing (ASA)
,
1993
.
[9]
C. E. Schmidt,et al.
Applications of nonlinear smoothing to speech processing
,
1975
.