Speaker dependent expression predictor from text: Expressiveness and transplantation

Automatically generating expressive speech from plain text is an important research topic in speech synthesis. Given the same text, different speakers may interpret it and read it in very different ways. This implies that expression prediction from text is a speaker dependent task. Previous work presented an integrated method for expression prediction and speech synthesis which can be used to model the diverse expressions in human's speech and build speaker dependent expression predictors from text. This work extends the integrated method for expression prediction and speech synthesis into a framework for speaker and expression factorization. The expressions generated by the speaker dependent expression predictors can be represented in a shared expression space, and in this space the expressions can be transplanted between different speakers. The experimental results indicate that based on the proposed method, the expressiveness of the synthetic speech can be improved for different speakers. Furthermore this work also shows how important the speaker specific information is for the performance of the expression predictor from text.

[1]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mark J. F. Gales,et al.  Speech factorization for HMM-TTS based on cluster adaptive training , 2012, INTERSPEECH.

[3]  Norbert Braunschweiler,et al.  Unsupervised speaker and expression factorization for multi-speaker expressive synthesis of ebooks , 2013, INTERSPEECH.

[4]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[5]  Alex Acero,et al.  Factored adaptation for separable compensation of speaker and environmental variability , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Jerome R. Bellegarda,et al.  Further analysis of latent affective mapping for naturally expressive speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yongqiang Wang,et al.  Speaker and Noise Factorization for Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Mark J. F. Gales,et al.  An explicit independence constraint for factorised adaptation in speech recognition , 2013, INTERSPEECH.

[9]  Mark J. F. Gales,et al.  Integrated automatic expression prediction and speech synthesis from text , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.