Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

In this paper, we review the datasets of emotional speech publicly available and their usability for state of the art speech synthesis. This is conditioned by several characteristics of these datasets: the quality of the recordings, the quantity of the data and the emotional content captured contained in the data. We then present a dataset that was recorded based on the observation of the needs in this area. It contains data for male and female actors in English and a male actor in French. The database covers five emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension.

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  Carlos Busso,et al.  The ordinal nature of emotions , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[3]  J. Trouvain Phonetic Aspects of "Speech-Laughs" , 2001 .

[4]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[5]  Thierry Dutoit,et al.  ASR-based Features for Emotion Recognition: A Transfer Learning Approach , 2018, ArXiv.

[6]  Junichi Yamagishi,et al.  The SIWIS French Speech Synthesis Database ? Design and recording of a high quality French database for speech synthesis , 2017 .

[7]  Thierry Dutoit,et al.  An HMM approach for synthesizing amused speech with a controllable intensity of smile , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[8]  Emer Gilmartin,et al.  Introducing AmuS: The Amused Speech Database , 2017, SLSP.

[9]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[10]  J. Russell,et al.  The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology , 2005, Development and Psychopathology.

[11]  Thierry Dutoit,et al.  Breath and repeat: An attempt at enhancing speech-laugh synthesis quality , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[13]  Sandrine Brognaux,et al.  Train&align: A new online tool for automatic phonetic alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[15]  Andrea S Martinez-Vernon,et al.  An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes , 2018, PloS one.

[16]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[17]  Thierry Dutoit,et al.  An HMM-based speech-smile synthesis system: An approach for amusement synthesis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[18]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[19]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20]  Thierry Dutoit,et al.  Speech-laughs: An HMM-based approach for amused speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  K. Scherer,et al.  Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. , 2012, Emotion.

[22]  Thierry Dutoit,et al.  Annotating Nonverbal Conversation Expressions in Interaction Datasets , 2018 .

[23]  Thierry Dutoit,et al.  Exploring Transfer Learning for Low Resource Emotional TTS , 2019, IntelliSys.

[24]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[25]  T. Dalgleish Basic Emotions , 2004 .

[26]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).