The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the Chapter intends to assemble the different aspects of the theory and summarize the concepts.

[1]  S. Gazor,et al.  Speech probability distribution , 2003, IEEE Signal Processing Letters.

[2]  L. F. Barrett The theory of constructed emotion: an active inference account of interoception and categorization , 2016, Social cognitive and affective neuroscience.

[3]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[4]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[5]  Yoshua Bengio,et al.  Learning deep physiological models of affect , 2013, IEEE Computational Intelligence Magazine.

[6]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[8]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[11]  J. Russell A circumplex model of affect. , 1980 .

[12]  Carlos Busso,et al.  The ordinal nature of emotions , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[13]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[14]  P. Ekman An argument for basic emotions , 1992 .

[15]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Christelle Vergé,et al.  Traitement du signal , 2014 .

[18]  Mohammed Usman,et al.  Probabilistic Modeling of Speech in Spectral Domain using Maximum Likelihood Estimation , 2018, Symmetry.

[19]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Thierry Dutoit,et al.  Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis , 2019, INTERSPEECH.

[21]  Alex M. Andrew,et al.  INFORMATION THEORY, INFERENCE, AND LEARNING ALGORITHMS, by David J. C. MacKay, Cambridge University Press, Cambridge, 2003, hardback, xii + 628 pp., ISBN 0-521-64298-1 (£30.00) , 2004, Robotica.

[22]  Thierry Dutoit,et al.  Exploring Transfer Learning for Low Resource Emotional TTS , 2019, IntelliSys.

[23]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[24]  Felix Burkhardt,et al.  Emotional Speech Synthesis , 2015 .

[25]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).