Adaptive Speech Synthesis in a Cognitive Robotic Service Apartment: An Overview and First Steps Towards Voice Selection

The Cognitive Robotic Service Apartment is both a realistic apartment and a laboratory environment in which the one or several user(s) interact with various manifestations of an intelligent agent e.g. a talking head. We expect that across various situational settings in the apartment, different specifications and adaptations of the synthetic voice will become necessary. Some of the dynamic adaptations will depend on physical factors e.g. ambient noise affecting speech intelligibility others on interpersonal factors e.g. familiarity and even others on the manifestation of the artificial agent itself e.g. the agent's voice, perceived gender, age and competence. It is the overall aim of our ongoing project to build a voice for a dynamically speech synthesis adaptation across various typical interaction scenarios and agent manifestations (robot, virtual agent). In the final implementation, the voice adaptation will be realized incrementally, i.e. the adaptation will be effected while talking. The adaptive synthesis module will be extended the existed incremental speech process system InproTK that is part of the cognitive architecture of the apartment. In order to determine an ideal set of adaptive parameters, a series of experiments is currently being planned and carried out. The paper will present our general methodology and describes our first study to find suitable synthesis voices for the virtual agent or humanoid robot used in the Cognitive Robotic Service Apartment. 1. Modeling Adaptive Speech Synthesis in the CSRA Modeling the interaction between humans and machines remains a major speech technological challenge. This affects not only the interfaces between the different interacting system components (ASR, NLU, dialogue model, NLG, TTS) but each component individually. Our present project focuses on the improvement of a speech synthesis component in an interactive system in general, and on the situation-specific adaptation and modification of the synthetic speech output in particular. Such adaptations of the voice, driven by communicative purposes, are natural in humans and necessary in machines mimicking human speech communication. The present paper outlines how dynamically adaptive synthetic speech is realized in an ongoing research project as part of a complex interaction environment called the Cognitive Service Robotic Apartment (CSRA). 1.1 Situation-specific adaption in human speech production An everyday example for situation-specific human speech adaptation has become famous as the Lombard Effect: Quite often dialogues between humans take place in noisy environments (outdoors in the presence of traffic noise; indoors with background music or with several people engaged in chatting simultaneously, e.g. in a pub). These conditions impede the intelligibility of the spoken content caused both by limited transmission quality and by the speakers' limited ability to self-monitor their voices. E. Lombard was the first researcher who discovered adaptation processes in speech produced under noisy conditions and his findings initiated a lot of subsequent research in this area [9]. His main observation was that selfmonitoring is the regulator between speech production and perception and that lacking selfmonitoring leads to an involuntary adaptation process to the environmental conditions, i.e. it leads to Lombard Speech. Many studies investigated the Lombard Effect from a medical or psychological perspective, but more recently, it has been investigated also from an acoustic, phonetic, linguistic and speech technological perspective. These studies could show that compared to speech in a quiet environment, Lombard Speech exhibits a decreased speaking rate, an increased fundamental frequency (F0) and range, a shift of intensity from low to high frequency, an increased vowel duration and a shift of F1 and F2 [6, 10]. However, the identified differences depend both on the speaker and the amount and type of ambient noise [7]. Lombard Speech adaptations occur spontaneously, immediately and unintentionally and thus have a different cause than phonetically similar, but intended adaptations such as the kind of speech addressed at an inattentive listener, a distant listener, a bad ASR, a listener with hearing problems, or a listener unaware of a potential danger. So far, very little is known about Lombard Speech occurring under real-life communicative conditions as it has mostly been investigated in monologue reading tasks. Still, it can be safely assumed that humanhuman communication certainly profits from Lombard Speech as its adaptations serve to improve intelligibility [6, 11]. Therefore, despite the fact that we cannot know precisely whether intended adaptations made for the cause of an improved intelligibility resembles Lombard Speech in all its facets, we make this simplified assumption in our ongoing study. 1.2 Adaptive interaction in the CSRA Our project’s interaction architecture is a Cognitive Robotic Service Apartment (CSRA). Unlike typical speech synthesis evaluations, this setting enables us to evaluate our adaptation strategies both under real-life and laboratory conditions. The former is possible as the humanapartment interactions are monitored permanently and across a wide range of everyday “university lab” situations such as demo tours, meetings or lunchtime chats in course of which individuals or groups interact with the interactive components both verbally and non-verbally. The verbal interactions will use different manifestations of intelligent agents such as a humanoid robot, a virtual agent or a disembodied apartment voice. Therefore, the agent’s interaction strategy should suit various settings (information, service, interaction with a group, interaction with an individual, formal/informal settings) and their concrete manifestations (background music, quiet environment, attentive/inattentive user). We assume that the perceived interaction quality is at least to some extent influenced by the agents’ overall voice quality and design as these factors are associated with characteristics such as perceived competence, trustworthiness, dominance, anxiety, reliability or credibility. Therefore, in a first step, a set of suitable voices and designs working across various types of artificial agents and situations needs to be determined. It is possible that the suitability of a voice is to some extent situation dependent, e.g. it might be more important to have a “competent” sounding voice in a formal situation where the agent explains something, while a “friendly, warm” voice might be more important in an informal situation where the agent welcomes the user. 1.3 Modeling adaptation in synthetic speech In contrast to speech recognition systems [10, 8], the realization of Lombard Speech or similar types of environmental adaptation in synthetic speech synthesis is hitherto not well understood. This comes somewhat as a surprise as such adaptations can be expected to improve both intelligibility and perceived naturalness. Two potential adaptive strategies can be identified. One approach is the generation of an artificial voice trained with a different speaking style, e.g. a Lombard Speech corpus recorded in a noisy environment. Those methods produce speaker-dependent synthetic voices and require a large amount of training data [15]. Another strategy lies in the modification of an existing 'neutrally speaking' voice. Such adaptations are achieved via the modification of extracted speech parameters such as F0, energy or spectral characteristic and a subsequent re-synthesis. One advantage of this solution is that no new training data are required. More importantly, such an adaptation can be performed dynamically, speedy and incrementally, without the need to switch to a different “voice”. Such a dynamic, incremental type of adaptation to the situational needs models the automaticity of the Lombard Effect in humans (cf. above) and may therefore significantly contribute to the perceived naturalness of the resulting interaction, as has been previously shown for other aspects of verbal interaction in human-machine dialogues (cf. below). In order to objectively assess the intelligibility of the synthetic speech thus modified, several solutions were proposed in the literature, mainly based on human auditory system modeling (Glimpse Proportion, Dau model) and relying on the signal-to-noise ratio (SNR) [4]. 1.4 Adaptation as part of incremental speech processing In order to realize the situation-specific adaptations, the synthetic speech is realized within the speech-processing tool InproTK as part of the cognitive architecture of the CSRA apartment [1]. It includes a speech recognition module and a speech synthesis module and manages the speech input and output for the human-machine communication together with the dialog management tool Pamini [9]. Our speech adaptation module is based on the speech synthesis module using a modified version of the MaryTTS synthesizer [13]. This modification of the internal data structures was necessary to support the incremental processes offered by InproTK: Incremental speech processing means that the system can react just-in-time to situational changes in speech, e.g. disfluencies, interruptions or other environmental changes both on the side of speech recognition and speech synthesis. This is reached by a step-by-step bottom-up process. Each utterance is split into chunks (Incremental Units), which can be phonemes, words or an entire phrase before handled. For any type of adaptation, this functionality is highly suitable because it allows prosodic changes of speech such as the intensity or loudness in course of the synthesis process. Many conventional text-to-speech systems are based on the sequential processing of utterances. That means, before a next sentence is processed, the previous sentence is synthesized completely. Such a traditional architecture allows adaptation only on a full

[1]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[2]  Min Zhang Proceedings of the ACL 2012 System Demonstrations , 2012 .

[3]  Florian Schiel,et al.  The Lombard Effect in Spontaneous Dialog Speech , 2011, INTERSPEECH.

[4]  David Schlangen,et al.  The InproTK 2012 release , 2012, SDCTD@NAACL-HLT.

[5]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[6]  Paavo Alku,et al.  Lombard modified text-to-speech synthesis for improved intelligibility: submission for the hurricane challenge 2013 , 2013, INTERSPEECH.

[7]  Damjan Vlaj,et al.  The Influence of Lombard Effect on Speech Recognition , 2011 .

[8]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Yannis Stylianou,et al.  Evaluating the intelligibility benefit of speech modifications in known noise conditions , 2013, Speech Commun..

[10]  Simon King,et al.  Can Objective Measures Predict the Intelligibility of Modified HMM-Based Synthetic Speech in Noise? , 2011, INTERSPEECH.

[11]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[12]  David Schlangen,et al.  INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis , 2012, ACL.

[13]  Priscilla Lau The Lombard Effect as a Communicative Phenomenon , 2008 .

[14]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[15]  Britta Wrede,et al.  Pamini: A framework for assembling mixed-initiative human-robot interaction from generic interaction patterns , 2010, SIGDIAL Conference.