Expressive Speech Synthesis: Past, Present, and Possible Futures

Approaches towards adding expressivity to synthetic speech have changed considerably over the last 20 years. Early systems, including formant and diphone systems, have been focused around “explicit control” models; early unit selection systems have adopted a “playback” approach. Currently, various approaches are being pursued to increase the flexibility in expression while maintaining the quality of state-of-the-art systems, among them a new “implicit control” paradigm in statistical parametric speech synthesis, which provides control over expressivity by combining and interpolating between statistical models trained on different expressive databases. The present chapter provides an overview of the past and present approaches, and ventures a look into possible future developments.

[1]  R. Willis,et al.  The Anthropology of the Body , 1978 .

[2]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[3]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[4]  Jean Vroomen,et al.  Duration and intonation in emotional speech , 1993, EUROSPEECH.

[5]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[6]  Barbara Heuft,et al.  Emotions in time domain synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Mike Edgington,et al.  Investigating the limitations of concatenative synthesis , 1997, EUROSPEECH.

[8]  Sjl Mozziconacci Speech variability and emotion : production and perception , 1998 .

[9]  Erhard Rank,et al.  Generating emotional speech with a concatenative synthesizer , 1998, ICSLP.

[10]  Sjl Mozziconacci,et al.  Role of intonation patterns in conveying emotion in speech , 1999 .

[11]  J. Montero,et al.  ANALYSIS AND MODELLING OF EMOTIONAL SPEECH IN SPANISH , 1999 .

[12]  M. Schröder CAN EMOTIONS BE SYNTHESIZED WITHOUT CONTROLLING VOICE QUALITY , 1999 .

[13]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[14]  Nick Campbell,et al.  Automatic labelling of voice-quality in speech databases for synthesis , 2000, INTERSPEECH.

[15]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[16]  Shrikanth Narayanan,et al.  Limited domain synthesis of expressive military speech for animated characters , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[17]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[18]  Marc Schröder,et al.  Experimental study of affect bursts , 2003, Speech Commun..

[19]  Nick Campbell,et al.  Speech Database Design for a Concatenative Text-to-Speech Synthesis System for Individuals with Communication Disorders , 2003, Int. J. Speech Technol..

[20]  Christophe d'Alessandro,et al.  Voice quality modification for emotional speech synthesis , 2003, INTERSPEECH.

[21]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[22]  Marc Schröder,et al.  Expressing vocal effort in concatenative synthesis , 2003 .

[23]  Hideki Kawahara,et al.  Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system , 2003, INTERSPEECH.

[24]  Hui Ye,et al.  High quality voice morphing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Takao Kobayashi,et al.  A style control technique for HMM-based speech synthesis , 2004, INTERSPEECH.

[26]  Silvia Quazza,et al.  Towards emotional speech synthesis: a rule based approach , 2004, SSW.

[27]  Marc Schröder,et al.  How (Not) to Add Laughter to Synthetic Speech , 2004, ADS.

[28]  Marc Schröder,et al.  Voice quality interpolation for emotional text-to-speech synthesis , 2005, INTERSPEECH.

[29]  Olivier Rosec,et al.  Estimation of LF glottal source parameters based on an ARX model , 2005, INTERSPEECH.

[30]  Nick Campbell Developments in Corpus-Based Speech Synthesis: Approaching Natural Conversational Speech , 2005, IEICE Trans. Inf. Syst..

[31]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[32]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Marc Schröder,et al.  Expressing degree of activation in synthetic speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  N. Audibert,et al.  Expressive Speech Synthesis: Evaluation of a Voice Quality Centered Coder on the Different Acoustic Dimensions , 2006 .

[35]  I. Poggi,et al.  Perception of non-verbal emotional listener feedback , 2006 .

[36]  Takao Kobayashi,et al.  Model Adaptation Approach to Speech Synthesis with Diverse Voices and Styles , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[37]  Roger K. Moore Spoken language processing: Piecing together the puzzle , 2007, Speech Commun..

[38]  Frank K. Soong,et al.  Perceptual annotation of expressive speech , 2007, SSW.

[39]  Marc Schröder Interpolating Expressions in Unit Selection , 2007, ACII.

[40]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[41]  Peter Birkholz,et al.  Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets , 2007, INTERSPEECH.

[42]  N. Campbell APPROACHES TO CONVERSATIONAL SPEECH RHYTHM: SPEECH ACTIVITY IN TWO-PERSON TELEPHONE DIALOGES , 2007 .

[43]  Charlotte Wollermann,et al.  Modeling and perceiving of (un-)certainty in articulatory speech synthesis , 2007, SSW.

[44]  Bhuvana Ramabhadran,et al.  Automatic exploration of corpus-specific properties for expressive text-to-speech: a case study in emphasis , 2007, SSW.