Speech Synthesis Systems in Ambient Intelligence Environments

Publisher Summary This chapter discusses the state of the art in speech synthesis systems and the components necessary to incorporate ambient intelligence characteristics in them. Spoken interaction is probably the most effective means of human communication. Speech is an essential characteristic of humans that sets them apart from other species. It has evolved to become extremely flexible, variable, and consequently very complex. A traditional speech synthesis system consists of four major components—text generator, text processor, speech unit generator, and prosody generator. A speech synthesis system that talks to the user is an example of direct communication, which can take place in many instances and for various purposes, such as alerting, informing, answering, entertaining, and educating. The conditions under which such services are provided can vary. Also, naturally, users can vary significantly based on time, sex, age, education, experience, culture, scientific and emotional intelligence, needs, wealth, and so forth. This chapter summarizes the state of current speech synthesis technology, outlines its essential highlights and limitations, and projects future opportunities in the context of human-centric AmI interfaces. The goal is speech synthesis systems in AmI environments that will be able to produce the right speech at the right place and time and to the right person. This is a highly challenging task that requires multidisciplinary research in several fields.

[1]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[2]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Shrikanth Narayanan,et al.  An exploratory study of emotional speech production using functional data analysis techniques , 2006 .

[4]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[5]  Shrikanth S. Narayanan,et al.  Analysis of emotional speech prosody in terms of part of speech tags , 2007, INTERSPEECH.

[6]  Iain R. Murray,et al.  RULE-BASED EMOTION SYNTHESIS USING CONCATENATED SPEECH , 2000 .

[7]  Francesco Vatalaro,et al.  Ambient Intelligence: The Evolution of Technology, Communication and Cognition Towards the Future of Human-Computer Interaction , 2005 .

[8]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[9]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[10]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[11]  Gregor Möhler,et al.  Rules for the generation of ToBI-based American English intonation , 1999, Speech Commun..

[12]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[13]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[15]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[16]  J. Montero,et al.  ANALYSIS AND MODELLING OF EMOTIONAL SPEECH IN SPANISH , 1999 .

[17]  Yannis Stylianou,et al.  TD-PSOLA versus harmonic plus noise model in diphone based speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Kim E. A. Silverman,et al.  Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect , 1985 .

[19]  Shrikanth S. Narayanan,et al.  A study of emotional speech articulation using a fast magnetic resonance imaging technique , 2006, INTERSPEECH.

[20]  Yannis Stylianou,et al.  Exploration of acoustic correlates in speaker selection for concatenative synthesis , 1998, ICSLP.

[21]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[22]  Kent Larson,et al.  Using a Live-In Laboratory for Ubiquitous Computing Research , 2006, Pervasive.

[23]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 1978 .

[24]  John L. Arnott,et al.  Synthesizing emotions in speech: is it time to get excited? , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[26]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[27]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[28]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[29]  H. De Man,et al.  Ambient intelligence: gigascale dreams and nanoscale realities , 2005 .

[30]  Emile H. L. Aarts,et al.  Ambient intelligence: a multimedia perspective , 2004, IEEE MultiMedia.