论文信息 - Transformation of LF Parameters for Speech Synthesis of Emotion: Regression Trees

Transformation of LF Parameters for Speech Synthesis of Emotion: Regression Trees

This paper outlines an approach to modelling the dynamics of voice source parameters as observed in the analysis of emotional portrayals, by a male speaker of Hiberno-English. The emotions portrayed were happy, angry, sad, bored, and surprised, as well as neutral. The voice source parameters extracted from emotionally coloured repetitions of a short utterance – by means of inverse filtering followed by source model matching – were modelled using classification and regression trees. Regression trees were built using the voice source parameters of the neutral repetition of the same short utterance, in order to transform the voice source parameters from neutral to one of the five emotions. Re-synthesis of emotion-portraying utterances using transformed voice source parameter dynamics resulted in synthesised utterances which were confirmed by listening tests to represent the targeted emotion categories. The results suggest that the addition of dynamic voice source information in parametric synthesis of emotion will improve the quality of emotion synthesis. Following from a detailed voice source analysis of a small database of male speech [1], in which the speaker portrayed the basic emotions: anger, joy, boredom, sadness, and surprise, a number of synthesis implementations and perceptual tests were carried out. The aim was (i) to attempt to verify the results of the analysis through synthesis, (ii) to explore whether changing the source parameter settings while maintaining the neutral filter settings would generate emotionally coloured output, and (iii) to propose a method for modelling the dynamics of voice source parameters in an utterance in a way that can be used to generate synthetic utterances which have emotional colouring. This method aims to model, for the source parameters, the relationships between the neutral and the emotion-portraying utterances. The information used for the modelling consists of four source parameters based on the analysis of the above mentioned database (see [1]) as well as information on the relative

C. Gobl | Michelle Tooher | Irena Yanushevskaya

[1] D. Klatt,et al. Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[2] Rolf Carlson,et al. Experiments with emotive speech - acted utterances and synthesized replicas , 1992, ICSLP.

[3] W. Sendlmeier,et al. Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[4] W. Loh,et al. REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[5] Ailbhe Ní Chasaide,et al. The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[6] K. Scherer,et al. Vocal expression of affect , 2005 .

[7] N. Audibert,et al. Expressive Speech Synthesis: Evaluation of a Voice Quality Centered Coder on the Different Acoustic Dimensions , 2006 .

[8] Ailbhe Ní Chasaide,et al. Time- and Amplitude-Based Voice Source Correlates of Emotional Portrayals , 2007, ACII.

[9] J. Liljencrants,et al. Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .